# Wolfram Rollett · Hannah Bijlsma · Sebastian Röhl Editors

# Student Feedback on Teaching in Schools

Using Student Perceptions for the Development of Teaching and Teachers

Student Feedback on Teaching in Schools

Wolfram Rollett · Hannah Bijlsma · Sebastian Röhl Editors

# Student Feedback on Teaching in Schools

Using Student Perceptions for the Development of Teaching and Teachers

*Editors* Wolfram Rollett Institute for Educational Sciences University of Education Freiburg, Germany

Sebastian Röhl Institute for Educational Sciences University of Education Freiburg, Germany

Hannah Bijlsma Section of Teacher Professionalization University of Twente Enschede, The Netherlands

ISBN 978-3-030-75149-4 ISBN 978-3-030-75150-0 (eBook) https://doi.org/10.1007/978-3-030-75150-0

© The Editor(s) (if applicable) and The Author(s) 2021. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# **Foreword**

Feedback is a hot topic—with so many studies, and now over 23 meta-analyses on the effect of feedback on achievement (and many more on other outcomes and in other disciplines). Most of this research is premised on feedback from teachers to students, whereas a critical missing link is the measurement, quality, and impact of feedback from students to teachers. This is a well-rehearsed topic in tertiary classes, but far less so in K-12. This book begins to add foundations to the debates about feedback to teachers in elementary, secondary, and high schools.

When I published Visible Learning in 2009, the average effect size from feedback to students was 0.78, and from feedback to teachers was 0.50 (nearly all of the latter studies being from university students). We recently revisited the meta-analyses and located almost every study in the 23 meta-analyses, and recalculated the individual and overall effect: the overall to students has reduced to 0.48 (Wisniewski et al. 2020, also see Chap. 8). One major reason is the over hype about feedback, the misplaced emphasis on increasing the quantity of feedback; the ignoring of the massive variability of feedback has led this to decrease. The variability is core to understanding the effect—This was seen in the major synthesis by Kluger and deNisi (1996), who observed that about one-third of feedback is negative, and who were careful to note that the search for these moderators is core. The same feedback may work for me but not you, the same feedback to me today works but not tomorrow. Understanding this variability is the core and this is so often forgotten.

The search for these critical moderators has been underway and gathering pace for many years. There are, from my research on feedback on learning, at least five moderators: Feedback is maximized when there is "where to next/improvement focused" information in the feedback; when feedback is aligned with the instructional cycle (about the task, process, self-regulation); that praise dilutes the effects (as students focus on and recall the praise over the information); when we overly focus on the giving compared to the reception of feedback; and, most critically, the effect on students is higher when teachers demonstrate they are willing to receive feedback about their impact.

Like students, teachers need to hear, understand, and action the feedback they receive. Some teachers are impervious to feedback, thinking that their task is to "give" feedback to students, not receive it themselves; some use the many cognitive biases which make us humans to reinterpret student feedback (such as confirmation bias, where good feedback is about me and negative feedback is about the students); some are extremely good at selectively listening to student feedback; some dismiss student feedback as ill-informed and the whims of youngsters; and some collect the feedback too late, so it has little impact on improving the teaching for the students. In my own case as a University Professor, I have read student evaluations of my teaching for many decades and if they do not say "He speaks too fast," then I question the validity of the other student responses. But the proper question should be: Why have I not improved my speaking skills? Such confirmation bias, dismissing of the value of student responses, and seeing evaluations as more worthwhile for promotion than improvement, means that I and my students are the losers.

It does seem ironic that teachers will listen to feedback from external adult observers who come into their classes and conduct fleeting observations of their teaching. This feedback from observers is more often about *how* they teach, and usually not about the impact of their teaching on the students. There is a corpus of research, noted in many chapters, of the major issues with the reliability of these external observations. But there is already a plethora of studies showing the high reliabilities of student feedback to teachers, which is then often dismissed—Students could not possibly have worthwhile information, are biased (as they do not know good teaching!), and they are recipients not informers of teaching. Why do we (a) see unreliability as a major question for observational methods, but (b) when student feedback has been shown to meet these reliability criterions, we change the blame to other factors? This book reviews the evidence on how to maximize the reliability and validity of student feedback, outlines many of the more dependable measures, and invites deeper discussion on the informational value which can be derived from students' feedback to teachers.

Students are not, however, "consumers"; they are learners, and learning is often messy. Failure, therefore, needs to be a learner's best friend and welcomed by the teacher as opportunity to learn. Hearing from the students about their struggles, what they do not know, their ways of thinking about the content, their conceptions and misconceptions—surely this is the food for more effective teaching. Teachers who are impervious to this feedback are more likely to attribute failure to the students, devise explanations why the students cannot learn (they come from poor backgrounds, unsupportive families, are not well enough prepared for my class, they do not pay attention, are disruptive, have fixed mindsets, and so on)—when they themselves are the only people in the room paid to improve.

The editors set the scene by outlining the Process Model of Student Feedback on Teaching (SFT; Chap. 1 in this volume by Röhl, Bijlsma, and Rollett), which deals with ensuring a comprehensive, reliable, interpretable collection of data - but also has a major emphasis on how student feedback is understood and interpreted by the teacher. There is much emphasis on the teacher as receiver of the feedback—and this is investigated from cognitive as well as from affective perspectives. There is then much debate about the many instruments, their dimensionality, construction, and measurement properties. With the move in the 1990s such that validity now is seen through the lens of whether the test information is correctly interpreted and has consequential impact, we do need to know more about the quality of the reports.

Van der Lans (Chap. 5) asks a core question about two teachers, John and Tess. Given that we have collected information from students about these teachers—What can we advise Tess and John? This is a reasonably undeveloped territory, and possibly could account for so much dismissal of student feedback—If we do not intend to use the information to improve, then what is its value, and why collect it. This question begs the next generation of researchers to devote more to the consequences of asking for student evaluation. When we give feedback to students, it is more often to improve their learning, and we are most creative and effective in framing feedback to engender improvement. Ditto, when we receive feedback from students. We are not there yet.

It is thus fascinating to find that the only study characteristic in Röhl's (Chap. 9) meta-analysis of longitudinal effects of receiving feedback from students, is the level of support to the teacher. Treatments with a high level of individual support for reflecting on feedback and teaching development showed a significantly higher effect size (*d* = 0.52) than studies with a medium (*d* = -0.06) or low (*d* = 0.16) supportive level. That is, feedback to teachers only made sense when there was support for subsequent teaching development, with ongoing advice on the subsequent development processes through individual or group consultations, counselling, or professional learning communities (also noted in Fleenor, Chap. 14). This means in the presence of greater professional learning about the consequences and actions from the feedback.

One of my concerns about many of the current student evaluation tools is that they focus on particular ways to teach, and then assume if you teach in these ways that there are positive effects on students. Once again, confusing correlation with causation. In most of the instruments outlined in these chapters, there are too few asking for feedback about the students' learning, and the impact teachers are having on students. We know, for example, if a student believes that the teacher has no credibility, then the teacher is unlikely to have much impact—even if they are using all the desirable teacher strategies with great classroom climates. This is a call for more about the students' conceptions of what it means to be a learner, whether the class is an inviting place to come and learn, and students' conceptions of their learning. Such feedback to teachers could be among the more powerful to improve teaching quality.

It seems a cogent discovery that much of the use of student evaluations, and probably their subsequent impact on improving the quality of the teaching, starts with a high sense of teacher self-efficacy. Röhl and Gärtner (Chap. 10) noted that this impact is based on the teacher's attitude towards considering students as trustworthy or competent as feedback providers. So osften I have seen teachers dismiss student evaluation, whereas building confidence in the informational value, the benefits to the teacher to thence learn how to improve, may be a critical first step; this once again highlights the importance of the value of the reports to help the teacher interpret and take actions. Also note the critical comments by Schweig and Martínez (Chap. 6) about the information in the variances. I was taught by my mentor, Rod McDonald (a famous psychometrician), that the answers often lie in the residuals—the detail is in the variances. No teacher would proclaim that all students think and learn alike—so attending to the variances (often missing in some current measures and reports) seems well worth pursuing. When you see variance, this is a great moment to triangulate with other information, and so any deviations or surprises can be more closely investigated.

This book provides a line in the sand. It reviews what we know about student feedback to teachers, it makes the powerful point that most teachers have positive attitudes towards receiving such feedback (see Göbel et al. in Chap. 11), outlines many of the measures and measurement issues, and raises the more important questions still to be resolved. This makes this book timely. It is detailed and it is a pleasure to read. To have these chapters in one place—and from those most up to date with the research literature and doing the research—is a gift.

> John Hattie Melbourne Graduate School of Education Melbourne, Australia

# **References**

Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and a preliminary feedback intervention theory. *Psychological Bulletin, 119*(2), 254–284. https://doi.org/10.1037/0033-2909.119.2.254.

Wisniewski, B., Zierer, K., & Hattie, J. (2020). The power of feedback revisited: A meta-analysis of educational feedback research. *Frontiers in Psychology, 10,* 3087. https://doi.org/10.3389/ fpsyg.2019.03087.

# **Contents**



# **Editors and Contributors**

# **About the Editors**

**Wolfram Rollett** is a Professor of Empirical Educational Research at the University of Education Freiburg and the Freiburg Advanced Center of Education (FACE). Previously he worked as a researcher and lecturer in the field of Educational Science and Psychology at the Universities of Potsdam, Braunschweig, Dortmund, and Wuppertal. His research focuses on school development processes, the quality of extra- and co-curricular activities, educational effectiveness, and classroom composition.

**Hannah Bijlsma** is a researcher (Ph.D.) at the section of Teacher Professionalization at the University of Twente (The Netherlands) and a primary school teacher (grade 1). Her research focuses on measuring teaching quality and on the use of student perceptions of teaching quality in school contexts. In 2016 she founded a professional association for academic primary school teachers, of which she has been chairman for about five years. She is now a board member of the International Congress for School Effectiveness and Improvement (ICSEI) and a board member of the EARLI SIG on School Effectiveness and Improvement.

**Sebastian Röhl** has been an Academic Assistant in the Department of Educational Science at the University of Education Freiburg and is a currently Postdoctoral Researcher in the Institute of Education at Tübingen University (Germany). Before that he worked for more than 10 years as a grammar school teacher, school development consultant, and in teacher training. Among other areas, he conducts research in the fields of teaching development and teacher professionalization through feedback, social networks in inclusive school classes, as well as teachers' religiosity and its impact on professionalism. In addition, he is the director of an in-service professional master's study program for teaching and school development.

# **Contributors**

**Hannah Bijlsma** Section of Teacher Professionalization, University of Twente, Enschede, The Netherlands

**Perry den Brok** Department of Educational and Learning Sciences, University of Wageningen, Wageningen, The Netherlands

**Benjamin Fauth** Institute for Educational Analysis (IBBW), Stuttgart, Germany

**John W. Fleenor** Center for Creative Leadership, Greensboro, NC, USA

**Holger Gärtner** Freie Universität Berlin, Berlin, Germany

**Caterina Gawrilow** Department of Psychology, University of Tübingen, Tübingen, Germany

**Kerstin Göbel** Faculty of Educational Sciences, University of Duisburg Essen, Essen, Germany

**Richard Göllner** Hector Research Institute of Education Sciences and Psychology, University of Tübingen, Tübingen, Germany

**Valerie Hall** University of Wolverhampton, Wolverhampton, UK

**Mari-Ana Jones** Norwegian University of Science and Technology, Trondheim, Norway

**Tim Mainhard** Department of Education, University of Utrecht, Utrecht, The Netherlands

**José Felipe Martínez** University of California, Los Angeles, CA, USA

**Katharina Neuber** Faculty of Educational Sciences, University of Duisburg Essen, Essen, Germany

**Meike Raaflaub** University of Teacher Education, Bern, Switzerland

**Sebastian Röhl** Institute for Educational Sciences, University of Education, Freiburg, Germany

**Wolfram Rollett** Institute for Educational Sciences, University of Education, Freiburg, Germany

**Jan-Erik Schmidt** Center for School-Quality and Teacher Education, Tübingen, Germany

**Jonathan D. Schweig** RAND Corporation, Arlington, VA, USA

**Bob Uttl** Mount Royal University, Calgary, AB, Canada

**Rikkert van der Lans** LUMC-Curium - Child and Adolescent Psychiatry, Leiden, The Netherlands

**Wolfgang Wagner** Hector Research Institute of Education Sciences and Psychology, University of Tübingen, Tübingen, Germany

**Benedikt Wisniewski** Faculty of Philosophy and Social Sciences, University of Augsburg, Augsburg, Germany

**Corinne Wyss** FHNW School of Education, Brugg, Switzerland

**Klaus Zierer** Faculty of Philosophy and Social Sciences, University of Augsburg, Augsburg, Germany

# **Chapter 1 The Process Model of Student Feedback on Teaching (SFT): A Theoretical Framework and Introductory Remarks**

**Sebastian Röhl, Hannah Bijlsma, and Wolfram Rollett**

**Abstract** Student feedback on teaching in schools, conceptualized as information on student perceptions of teaching, is described by many scholars as an effective instrument for the developmental use of teachers and teaching. Beyond that, various studies show that the productive use of this method is a very complex process in which a variety of aspects must be considered. As an introduction to this volume, this chapter presents a model based on findings from different research areas of feedback and school research, called *Process Model of Student Feedback on Teaching* (SFT). This model follows the steps of the student feedback process, starting with student perceptions of teaching, which must be professionally collected or measured. Subsequently, the teacher perceives and interprets this feedback information, which is linked to cognitive and affective reactions and processes. This can lead to an enhancement of teachers' knowledge about their own teaching and to the initiation of improvement-oriented actions, finally resulting in improved teaching and development of the teachers' professional competence. Thereby, characteristics of the organization, the students, and classes as well as the teachers need to be considered. This model serves as a framework for the subsequent overview of the contributions in this volume.

**Keywords** Student feedback · Process model · Student perceptions of teaching quality · Teacher development

S. Röhl (B) · W. Rollett (B) University of Education, Freiburg, Germany e-mail: sebastian.roehl@ph-freiburg.de

W. Rollett e-mail: wolfram.rollett@ph-freiburg.de

H. Bijlsma

Section of Teacher Professionalization, University of Twente, Enschede, the Netherlands e-mail: h.j.e.bijlsma@utwente.nl

# **1 Student Feedback in Schools**

Student learning processes are influenced by many different factors, including student, home, school, peer, headteacher, and teacher effects (Hattie, 2009). In schools, teachers are considered to be the most malleable, within-school influence on student learning (Haertel, 2013; Nye et al., 2004), because the teacher determines the events in the classroom to a large extent. In order to be able to work and develop toward their full potential, it is important for teachers to receive information on the quality of their teaching. The teachers can gain some insight into this through the results of monitoring learning success and their own class observations. Such information proves to be particularly helpful when it comes from outside sources. This "information provided by an agent (…) regarding aspects of one's performance" (Hattie & Timperley, 2007, p. 81) is typically labeled as *feedback*. If feedback on teaching is considered as valuable information about teachers' performance and used accordingly, it can have positive effects on the professional development of teachers, the quality of teaching and student learning (Garet et al., 2017). Therefore, ideally there would be enough time and available methods to provide teachers with constructive feedback about their teaching in order to improve the quality of their teaching, and, as a follow-up, to positively affect the learning processes of their students.

However, often little energy is invested in education on constructive feedback to teachers about the quality of their teaching (Frase & Streshly, 1994; Voerman et al., 2012). At the same time, classroom observations by an external observer are quite common in many school systems (Darling-Hammond, 2013). Unfortunately, to obtain a truly reliable picture of teaching quality, it is necessary to rate lessons several times, and these observations should be made by several trained observers (Praetorius et al., 2014). This makes the use of classroom observations time-consuming and expensive. On the other hand, using teachers' self-assessments of their lessons might result in invalid data, because it is questionable whether teachers are able to judge their own lessons—as they see teaching only from their own perspective (Kruger & Dunning, 1999; Visscher, 2017)—and such self-assessments can hardly be looked on as an "information provided by an agent".

Another way to provide teachers with feedback is to use student perceptions of teaching quality (Muijs, 2006; Peterson et al., 2000). If student perceptions are used, the number both of observed lessons (in cases where students access one teacher's teaching over several lessons) and of observers (the number of students) is larger than in the case of lesson observations by external persons, which could thus improve the reliability of the feedback scores (Fauth et al., 2014). In addition, student perceptions reflect the perspective of the target group (Kane & Staiger, 2012; Quaglia & Corso, 2014; Staiger, 2012).

Although there are concerns about the validity and reliability of student perceptions of teaching quality (e.g., the extent to which students are able to discriminate between the different facets of teaching: de Jong & Westerhof, 2001; Fauth et al., 2014; Ferguson, 2012; Kunter & Baumert, 2006), recent studies have shown that student perceptions of teaching quality can provide reliable and valid information both for research purposes and as feedback to teachers for formative evaluation of the quality of their teaching (Burniske & Meibaum, 2012; Ferguson & Danielson, 2014; Kane et al., 2013; Kyriakides, 2005; Peterson et al., 2000).

# **2 Using Student Perceptions of Teaching for the Development of Teaching and Teachers—The Process Model of Student Feedback on Teaching (SFT)**

The basic idea of using student feedback for the development of teaching is to give teachers a comprehensive view of their teaching from the students' perspective, which might result in valuable information or data for teachers about the quality of their teaching. Based on the feedback, they can carry out improvement-oriented actions which might enhance their lessons. This, in turn, could result in more positive perceptions of the teaching by the students and improved learning processes for those students. The first experiments with this form of developing teaching were already being done in the USA in the 1920s (Remmers, 1927). Moreover, the underlying simplistic model of this approach is also commonly used in the data-based decision making research field (Lai et al., 2014; Poortman & Schildkamp, 2016; Schildkamp, 2019; van Geel et al., 2016), where it is stated that the use of teaching-related data, such as the evaluations of students' learning processes, can help to improve teaching and students' learning outcomes. In addition, the process of obtaining feedback from students and the associated student–teacher communication is an educational process, which can promote skills such as giving and receiving feedback, discussability, dealing with criticism, and different points of view (e.g., Bastian, 2010; Zierer & Wisniewski, 2019). Student feedback is still seen as a way of promoting *student voice* (Cook-Sather, 2002, 2007): the voice of students in their own education (Lincoln, 1995). It seems important for such a process to focus on the formative use of student ratings. A summative use of student ratings in schools for accountability purposes could hinder such effects as the teachers would need to justify their teaching. Notably, the use of student feedback for developmental purposes in schools seems to be almost exclusive to Western countries. A systematic literature review on this topic identified studies from Europe, the USA, Australia, and Turkey only, although validated student perception questionnaires for assessing teaching quality also exist for the Asian region, for example (see Chap. 9 of this volume).

Regarding the practical implementation of student feedback, it becomes apparent that the process of gathering and evaluating feedback is not quite that simple. For example, it is necessary both to overcome routines like basing decisions on intuition and instinct (Schildkamp & Kuiper, 2010) and for teachers to be data literate in order to use data systematically for the improvement of their lessons (Kippers et al., 2018; Mandinach & Gummer, 2016). Furthermore, the process of utilizing student perceptions of teaching quality as data to improve teaching is complex, as it is influenced by teacher, student, and class characteristics, and occurs within an organizational context

**Fig. 1** Process model of student feedback on teaching (SFT, *Source* Own)

(Schildkamp, 2019). Moreover, the success of this process can depend on many situational factors, such as questionnaire characteristics, personal reactions evaluating the feedback, or the choice of improvement-oriented actions based on the feedback. In this introductory chapter, we therefore gradually suggest a more complex model of the use of student feedback for developing teaching and teachers which—among other things—includes these factors. The model is visualized in Fig. 1.

The process starts with the students, who perceive the teaching in class. These perceptions can be captured via a student feedback survey. For this purpose, questionnaires are often used, which include items to be rated or open-ended questions. Student perceptions as well as their teaching quality ratings might be influenced by student and class characteristics (Bijlsma et al., 2019; Fauth et al., 2020; Levy et al., 1992).

Once the feedback is collected (that is, when the information or data are available), it must be understood and interpreted by the teacher. Following this, a cognitive process takes place, which is often described as *reflection* in the context of teacher training and professional development (e.g., Beauchamp, 2006; Korthagen & Wubbels, 1995). Reflection should lead to better understanding of one's own teaching, and subsequently to better teaching practice (Driessen et al., 2008; Ertmer & Newby, 1996).

Research on feedback from organizational psychology could provide important insights with regard to the process of reflection on feedback. According to Ilgen et al. (1979), the processing of received performance feedback follows several steps. First, the feedback message is perceived by the person receiving the feedback, in which the accuracy and intensity of the perception play an important role. Then a decision is made about the extent to which the perceived feedback message is accepted, i.e., whether the received information is considered to be truthful. As a result, the desire or intention to respond to the feedback can arise, followed by the setting of goals in this regard (intended response) and the implementation of the intended response in practice (see also Kinicki et al., 2004).

A comparable process model is also found in Smither et al. (2005), with the authors proposing the following steps: initial reactions, goal-setting and related actions, taking action and subsequent performance improvement. As an extension of these sequence models, Kahmann and Mulder (2011) included not only cognitive reactions to feedback, but also affective reactions by the person receiving the feedback, which can both eventually result in behavioral effects.

For our theory of action, we combined these sequence models considering the context of teachers as recipients of student feedback. Therefore, we view the perception and interpretation of feedback not only from cognitive perspectives, but also from affective ones.

Regarding these *emotional effects*, student feedback can evoke positive emotions such as satisfaction and joy, or negative ones such as dissatisfaction or defensiveness. These are primarily influenced by the actual as well as the expected positivity or negativity of the feedback, respectively. *Knowledge effects* can occur when feedback provides the teacher with new information about the students' view of his or her teaching or the feedback reinforces the teacher's existing knowledge. Then, a comparison between one's own perceptions and standards for teaching takes place. Discrepancies, which emerge, must be *accepted* in order for the teacher to consider changes in their teaching. Feedback data, which differ strongly from one's own objectives and standards concerning the own teaching together with negative emotional reactions, can lead to *rejection* of the feedback (Kahmann & Mulder, 2011; Kluger & DeNisi, 1996). Furthermore, it should be noted that a discrepancy between the actual state and the target state can also lead to abandonment or modification of the previously set objectives or standards in order to avoid or reduce any further effort (Kluger & DeNisi, 1996).

After the perception and acceptance of a possible area of improvement, *goals* for the elimination of a discrepancy can be set, followed by *planning* for the implementation of the intended response (Smither et al., 2005). Subsequently, as *behavioral effects*, *improvement-oriented actions* can take place, such as adaptive teaching to the different needs of students in the class (Gaertner, 2014), increased attention to specific aspects during teaching (Röhl & Rollett, 2021), discussions with students about the feedback for collaborative improvement (Gaertner, 2014), or participation in special training courses (Balch, 2012). If the actions have the desired effect on teaching practices, this might result in higher ratings from students in subsequent feedback surveys, and/or better learning outcomes.

We consider the presented process model to be a promising tool for structuring research and research questions on student feedback on teaching in schools. The model combines existing research from different research fields and covers what is known about developmental process of teaching and teachers based on student feedback. We acknowledge that the model is an ideal presentation; in real school settings, influencing factors like the student, class, teacher, and organizational aspects need to be considered. Therefore, the present volume covers a variety of topics linked to these influencing factors. Subsequently, we use this model to arrange and link the contributions and perspectives of this volume in their meaning and connection to this process.

# **3 Overview of the Volume**

In Part One of the volume, student perceptions of teaching quality and their validity and reliability are discussed by considering several theoretical and psychometric issues. These topics address issues which concern theoretical and research questions pertaining the beginning of the cycle of the Process Model of Student Feedback on Teaching (SFT) just introduced. In Chap. 2, Bijlsma et al. introduce the measurement of student perceptions from three psychometric perspectives which dominate contemporary research on teaching quality. They aim to connect psychometric theories and the different perspectives on what (measured) student perceptions are seen to be, as well as the different perspectives regarding how and for what purposes student perceptions should be used. In Chap. 3, Röhl and Rollett —in line with the Process Model of Student Feedback on Teaching (SFT)—discuss theoretically assumed teaching quality dimensions, which can be distinguished in student feedback surveys. Findings on the importance of teachers' communion with students (warmth or cooperation) as a potentially biasing factor in student ratings of instructional quality are also discussed. For Chap. 4, Bijlsma conducted a systematic review on the psychometric quality of student perception questionnaires (SPQ). She presents detailed overviews with general information about the SPQs, the results of the evaluation, and the constructs measured by the SPQs. In Chap. 5, van der Lans focuses on evidence showing that student questionnaires and classroom observation instruments can provide reliable feedback to teachers. He provides empirical evidence indicating that feedback of classroom observations and student questionnaires can be calibrated on the same continuum of instructional effectiveness; he moves on to discuss implications for theory, future research, and practice. In Chap. 6, Schweig and Martínez present an overview of literature from different fields which examines consensus in different measures of teaching quality. They consider these alongside key assumptions and consequences of those measurement models and analytic methods which are commonly used to summarize student survey reports of teaching quality. In Chap. 7, Göllner et al. continue with further findings on the particularities of student ratings of instructional quality, pointing out the importance of considering how exactly the referent and the addressee are noted in survey items and presenting related perspectives for future in-depth research approaches.

Part Two of the book focuses on the use of student feedback for the development of teaching and teachers. Following the SFT model, we arrive here at interpretation, reflection, and the teacher improvement elements. In Chap. 8, Wisniewski and Zierer start with an overview of functions of and success conditions for student feedback in the development of teaching and teachers. They point out why feedback is important for the professional development of teachers in general, and discuss three basic functions of student feedback in schools. This is followed, in Chap. 9, by Röhl's contribution, in which the first meta-analysis of the effects of student feedback on teaching quality in secondary schools is presented, providing insights on its effectiveness and potential moderating variables. In Chap. 10, Röhl and Gärtner systematize relevant factors influencing the utilization of student feedback by teachers into three domains: personal characteristics of feedback recipients (teachers), characteristics of the organization (school), and characteristics of feedback information (data). The two chapters which follow discuss student feedback from a more practical point of view. Göbel et al. (Chap. 11) focus on the use of student feedback to improve teaching quality during practical phases in teacher education. The authors discuss challenges and opportunities for the use of student feedback as an instrument for reflection on teaching and professional development for pre-service teachers. Schmidt and Gawrillow (Chap. 12) describe the theoretical parameters of reciprocal student–teacher feedback on cooperation between students and teachers, and outline results of an empirical study on the effects of the reciprocal method on the perceived quality of cooperation and on teacher health.

In the next part, three chapters of the volume provide supplementary perspectives on the use of student feedback for developing teaching and teachers, relating to the final part of the feedback cycle of the SFT model. Jones and Hall shed light on the critical pragmatism perspective (Chap. 13), and focus on how student feedback can facilitate dialogue and thus contribute to the development of schools as democratic communities. The multisource feedback perspective in organizations, and the transferability of this perspective to student-to-teacher feedback in schools is discussed by Fleenor (Chap. 14). In Chap. 15, Uttl overviews the lessons to be learned from research on student evaluation of teaching in higher education providing insights to be taken up in research on student feedback on teaching in schools.

Finally, in the concluding chapter of the book, Rollett et al. summarize the findings and conclusions drawn from the chapters in this volume and discuss the directions forward for researchers, policy makers, and schools.

**Acknowledgments** The editors would like to thank the Dutch Research Council (NWO; project number 36.201.009) and the University of Education Freiburg (Germany) for funding the open access publication of this volume.

# **References**


**Sebastian Röhl** has been an Academic Assistant in the Department of Educational Science at the University of Education Freiburg and is currently a Postdoctoral Researcher in the Institute of Education at Tübingen University (Germany). Before that he worked for more than 10 years as a grammar school teacher, school development consultant, and in teacher training. Among other areas, he conducts research in the fields of teaching development and teacher professionalization through feedback, social networks in inclusive school classes, as well as teachers' religiosity and its impact on professionalism. In addition, he is the Director of an in-service professional master's study program for teaching and school development.

**Hannah Bijlsma** is a researcher (Ph.D.) at the section of Teacher Professionalization at the University of Twente (the Netherlands) and a primary school teacher (grade 1). Her research focuses on measuring teaching quality and on the use of student perceptions of teaching quality in school contexts. In 2016 she founded a professional association for academic primary school teachers, of which she has been chairman for about five years. She is now a board member of the International Congress for School Effectiveness and Improvement (ICSEI) and a board member of the EARLI SIG on School Effectiveness and Improvement.

**Wolfram Rollett** is a Professor of Empirical Educational Research at the University of Education Freiburg and the Freiburg Advanced Center of Education (FACE). Previously he worked as a researcher and lecturer in the field of Educational Science and Psychology at the Universities of Potsdam, Braunschweig, Dortmund, and Wuppertal. His research focuses on school development processes, the quality of extra- and co-curricular activities, educational effectiveness, and classroom composition.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the Chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the Chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Part I Measuring Student Perceptions of Teaching: Reliability, Validity, and Theoretical Considerations**

# **Chapter 2 A Reflection on Student Perceptions of Teaching Quality from Three Psychometric Perspectives: CCT, IRT and GT**

# **Hannah Bijlsma, Rikkert van der Lans, Tim Mainhard, and Perry den Brok**

**Abstract** This chapter discusses student perceptions in terms of three psychometric perspectives that dominate contemporary research on teaching quality, namely, Classical Test Theory (CTT), Item Response Theory (IRT) and Generalizability Theory (GT). These perspectives function as being exemplars for the connection between psychometric theories and the different perspectives on "what a perception is" as well as on how and for what purposes student perceptions should be used. The main message of the chapter is that the choice of a psychometric theory is not merely a technical matter, but also has implications for how the nature of perceptions is conceptualized. After presenting and linking each psychometric theory, their strengths and weaknesses in the context of student perceptions of teaching quality and issues on practical implementations are discussed.

**Keywords** Student perceptions · Teaching quality · Classical test theory · Item response theory · Generalizability theory

H. Bijlsma (B)

R. van der Lans Curium LUMC, Leiden, The Netherlands e-mail: r.m.van\_der\_lans@curium.nl

P. den Brok Department of Educational and Learning Sciences, University of Wageningen, Wageningen, The Netherlands e-mail: perry.denbrok@wur.nl

Section of Teacher Professionalization, University of Twente, Enschede, the Netherlands e-mail: h.j.e.bijlsma@utwente.nl

T. Mainhard Department of Education, University of Utrecht, Utrecht, The Netherlands e-mail: m.t.mainhard@uu.nl

# **1 Introduction**

Student perceptions of teachers and their behaviours have become an important way to capture what happens in class. Questionnaires that map student perceptions of teaching quality are used, for example, to measure the effectiveness of educational interventions (Burniske & Meibaum, 2012; Kyriakides, 2005). In schools, student perceptions are collected by teachers to obtain feedback for improvement and professional development activities (Bijlsma et al., 2019).

Using student perceptions of teaching quality is a complex process. Typically, perceptions are collected using a standardized questionnaire instrument. When a student selects a response category of an item like "my teacher explains everything clearly to me", however, many processes may affect the student's answer. For example, a student may deliberately give a higher rating for the item than their real estimation of their teacher's skill at explanation because (s)he wants to present him/herself in a socially desirable way, or the student's perception may be biased by stereotypical impressions. Alternatively, the student might be honest and their perception unbiased, but a misinterpretation of the item content, for example, a different interpretation of what clarity means in this context, may still affect the item response (Maulana & Helms-Lorenz, 2016).

Moreover, items can be formulated according to the level of behaviour at which they are directed (to an individual student or the whole class), and in terms of the level of perception (personal, class). In Chap. 7 by Göllner et al. in this volume, it is referred to as differences in the referent and in the addressee of items. For example, the aforementioned item can be worded as: "This teacher explains things clearly to us/the class" (class perception, behaviour to class), "This teacher explains things clearly to me" (class perception, behaviour to individual), "I find this teacher to explain things clearly" (personal perception, behaviour to class) and "I find this teacher to explain things clearly to me" (personal perception, behaviour to individual). While this may seem trivial, it has consequences for the expected sources of variation in perceptions: items asking about class perceptions or behaviours directed at the whole class are more likely to evoke variation in shared sources of perceptions, while items asking about behaviours directed at individuals or personal perceptions are more likely to evoke variation in idiosyncratic sources of perceptions.

The question of what we actually measure, therefore, has no uniform answer. By completing standardized questionnaires, students give responses to many items and psychometric models are applied to combine the item ratings into an overall student perception of teachers' teaching (students' responses are then combined to a numerical value or score). This overall score—not the item ratings—is usually fed back to teachers or is used for research purposes. This approach of combining and integrating ratings into one overall perception score suggests that students cognitively process observations of teaching behaviours similarly and in such a general and integrated way. From this perspective, the psychometric models that connect and integrate the item ratings attempt to reconstruct students' mental representations of the teachers' teaching.

This chapter discusses student perceptions in terms of three psychometric perspectives that dominate contemporary research on teaching quality, namely, Classical Test Theory (CTT), Item Response Theory (IRT) and Generalizability Theory (GT). CTT (part 2) is based on the assumption that there is one true score and a variance score (error). The true score is then an average of all students' ratings on certain items that form a dimension or factor. In IRT (part 3), more emphasis is put on how many items relate to each other and what dimensions can be distinguished in the instrument used to collect student perceptions of teaching quality. The potential of GT (part 4) lies in the fact that it tries to disentangle the variability in student ratings beyond a "true score" and error, bringing in aspects such as personal characteristics and dyadic relationships between people. The chapter discusses these psychometric perspectives separately, but there are also integrated approaches that can enable researchers to estimate combinations of the models (Chalmers, 2012; Robitzsch et al., 2020). The connection between the CTT, IRT and GT with latent variable models becomes evident when it is realized that all specify a relationship between the teachers' latent ability level and the responses of students that were stimulated (or elicited) by the items (e.g., Chalmers, 2012; de Boeck et al., 2011; Rizopoulos, 2006; Robitzsch et al., 2020).

The main message of the chapter is that the choice of a psychometric model is not merely a technical matter, but also has implications for how the nature of perceptions is conceptualized. Finally, we acknowledge that the construct of teaching quality is highly contested and consensus about its conceptualization or definition is minimal (Cohen & Goldhaber, 2016). We do not present a definition of teaching quality in this chapter. By leaving the definition completely open, we intend to maximize our flexibility to discuss various possibilities offered by the three psychometric theories. After presenting and linking each psychometric theory, we will discuss their strengths and weaknesses in the context of student perceptions of teaching quality.

# **2 Classical Test Theory**

# *2.1 The CTT Model*

According to Classical Test Theory (CTT), student perceptions of teaching quality reflect the teachers' actual teaching quality plus random error variance (e.g., Brennan, 2001; Lord & Novick, 1968; Sijtsma, 2016; Spearman, 1905). The teachers' actual teaching quality is caught by the so-called "true score", which is statistically defined by the mean score over all item responses about that teacher. The error variance consists of all random deviations from the teacher's mean score (Novick, 1966). Furthermore, the CTT model states that all items are equally associated with the broader perceptual representation of the teachers' teaching (i.e., items are supposed to have similar factor loadings).


**Table 1** Possible example of feedback form results for one teacher teaching a class of 25 students

Marsh (2007) noted that overall questionnaire outcomes may be uninformative about *specific* teaching behaviours, and therefore recommends structuring questionnaires according to different factors. Factors cluster items that seem to have something in common based on the inter-item correlations. For example, the items, "My teacher explains everything clearly to me" and "My teacher uses clear examples" (see Table 1), are connected to the same factor, which clusters items related to the clarity and structuredness of explanations (Maulana & Helms-Lorenz, 2016). Reporting the class mean for items related to the clarity and structuredness of explanations is considered more informative than just an overall mean for all items.

In educational contexts, the CTT model is usually extended by including multiple nested levels of random error; for example, students are nested within teachers. The key idea of CTT, however, remains, in that only the mean of a factor is informative and variation around the mean is uninformative noise.

Paramount to the logic behind CTT is that item ratings related to the same teacher should show minimal variability *and* that item ratings related to different teachers should show large(r) variation. Hence, item ratings assigned by one student to the same teacher are expected to vary minimally. The mean student questionnaire scores from students within the same class are also expected to show minimal variability. These expectations are routinely examined by estimates of internal consistency (Cronbach, 1951) and intra-class correlations (ICCs; Lüdtke et al., 2009). Internal consistency is sensitive for items showing large variation in ratings compared to the other items' ratings. The ICC provides an estimate of the variance in mean questionnaire scores from students in different classes as proportionate to the variance of all ratings.

# *2.2 An Example of CTT in Practice*

Suppose that 25 students in a class respond to the item "My teacher explains everything clearly to me" by choosing one of the four answer options: 1 = "never", 2 = "seldom", 3 = "occasionally" and 4 = "often" (Table 1). If CTT is applied strictly, then the mean class perception (2.72) is the only reliable and, thus, the only informative parameter for the teacher to consider, and individual deviations are *random* noise. This logic can easily be generalized to a broader set of items. For example, the mean of the student questionnaire ratings can be computed and CTT can be applied to these mean scores, which may then be argued to be the most reliable estimate of the teacher's actual teaching quality. In this example, according to CCT, 2.81 reflects the teacher's teaching quality based on these five items.

# *2.3 Advantages and Limitations of the CTT Approach*

The CTT approach, and Marsh's (1987, 2007) work in particular, are well-known and studied in the educational sciences. Estimates of internal consistency and ICCs have proven to be stable across different questionnaires (cf. Marsh, 2007; van der Lans & Maulana, 2018). These statistics are also intuitively understandable for many practitioners and the application of CTT requires only a modest level of mathematical and statistical skill, which is not unimportant.

However, the use of CTT reflects high trust in the students as being honest and accurate perceivers. To illustrate this, suppose that students deliberately manipulate their ratings upwards because they like the teacher; then clearly such *systematic* bias or manipulation remains undetected by measures such as internal consistency and ICC, which quantify *random* error variance only (den Brok & Smart, 2007). In general, CTT provides very limited means to empirically investigate systematic biases in perceptions. Second, diagnosing poor item quality by the comparatively large variance in ratings, as is done by internal consistency measures, is only valid if one believes that ratings of all items must be biased by the same amount of (random) error. Suppose again that students deliberately manipulate their ratings upwards because they like the teacher; then their manipulation might well be expressed most in items referring to specific teacher traits that are likable (such as "humour", or "showing respect"). More in general, CTT fails to make (differentiated) predictions about the response process; for example, when students check a response category, it remains unsolved what latent cognitive representation of the teacher's teaching students had in mind.

# **3 Item Response Theory**

# *3.1 Item Response Theory (IRT) Model(s)*

According to IRT, student perceptions of teaching are ordered on a latent continuum (Bond & Fox, 2007; Embretson & Reise, 2013). With IRT, researchers estimate the teacher's position on this latent continuum and this position is then used to predict the most likely teacher behaviour that students will have experienced from this teacher. There are two levels at which IRT can be used to make predictions about what teacher behaviours students likely will have experienced: (1) the level of the item and (2) the level of the construct (Bond & Fox, 2007; Embretson & Reise, 2013). At the level of the item, IRT uses the response categories to make predictions about whether students experienced that particular behaviour seldom, occasionally or often. At the level of the construct, IRT makes predictions about how items jointly represent the teachers' teaching.

We will explain this by using one of the five items from Table 1 ("My teacher explains everything clearly to me" [explains clearly]). In Fig. 1, the y-axis indicates the probability of checking the higher response category out of two competing response categories and the x-axis indicates the level of teaching quality (θ). Teachers with a level of teaching quality located at the position of the arrow have a high probability of receiving a response "≥ seldom" on explains everything clearly to me, but a low probability of receiving a response "occasionally". The probability that students check the higher response category increases only when the teacher—according to the responding student—has achieved the conditions set by the higher response category for the item.

The item response process can be used to predict the most likely frequency with which the behaviour is observed (or the most likely impact, if the item labels are insufficient, sufficient, excellent). This item response process is part of a wider process here referred to as the construct response process. The construct response process predicts how students weigh and position items relative to other items. In IRT, one

**Fig. 1** Visualization of the item and construct response process


well-known construct response process is the Guttman scale or simplex1 (Guttman, 1954; Jöreskog, 1978). In the simplex, item positions depend on their "difficulty". Some items are much more likely to receive the rating "never" (called "difficult" items), while other items are much more likely to be rated as "often" (the "easy" items). Figure 2 visualizes this pattern using five items. In Fig. 2, the checkmarks indicate a high probability that students perceive the teacher to perform the behaviour described by the item often. Hence, student D is predicted to perceive the teacher as performing the first four behaviours often, but not the fifth. Item 1 would be a "difficult" item, and 5 would be an "easy" item.

To order items, IRT models include a location parameter (sometimes referred to as item difficulty). The location parameter predicts when the item response process changes within the wider construct response process. For example, the response process for item four is predicted to change if the first three items have received high ratings. Other item parameters that can be estimated by IRT models are the discrimination parameter (to predict and correct for systematic deviations from the predicted item response process), and a guessing parameter (to predict and correct for randomness in the item response process). In what follows, we will present an example of research applying IRT to student perceptions to illustrate the above.

# *3.2 IRT in Research on Student Perceptions*

Van de Grift and Kyriakides started independently implementing IRT in the context of teaching quality with student perception data (for details, see Antoniou & Kyriakides, 2013; Kyriakides et al., 2018; Maulana et al., 2015; van de Grift et al., 2011, 2014; van der Lans et al., 2015). Their models hypothesize that teaching effectiveness develops along a latent continuum in which learning to teach starts with learning less complex teaching behaviours (e.g., ensuring a safe classroom climate) and ends

<sup>1</sup> There are two other main classes of construct response processes, namely the Coombs/unfolding and the circumplex (Browne, 1992; de Leeuw & Mair, 2011; Mokken et al., 2001). It goes beyond the scope of this chapter to define and describe these as well. Hence, we focus here on the Guttman scale/simplex construct response.

with learning more complex teaching behaviours (e.g., having students cooperate with classmates). Hence, students' ratings given on questionnaires that list various teaching behaviours should indicate that they perceive some more complex teaching behaviours to be performed successfully less frequently, whereas they perceive other less complex teaching behaviours to be performed successfully more often (by more teachers). These researchers have applied Rasch-family models2—a specific type of IRT model—to test sequences of item complexity and to locate teachers on the latent continuum. After they have located the teacher, they provide the teacher with feedback by indicating the next steps for improvement (i.e., the items located just beyond the teacher's position). In other recent research, IRT has been used to examine issues of validity of student perception data (e.g., Bijlsma et al., submitted; van der Scheer et al., 2018).

# *3.3 Advantages and Limitations of IRT Models*

The comprehensive framework of IRT provides various possibilities for testing hypotheses concerning students' response processes at the level of the item and at the level of the construct. Thereby, IRT is promising as a way to develop and test theories that predict how different formulations of survey items and/or formulations of response categories translate into distinct item response and construct response processes. Substantive theories can also be translated into item and construct response processes, as in the example described in the previous section.

However, the disadvantage of IRT is that it basically assumes that the item response process is unbiased. Take the research we discussed by van de Grift et al. (2014). They predicted that student ratings will follow sequences predicted by theory on teacher development, but this prediction assumes that student ratings are a direct (unbiased) numerical representation of the teacher's actual behaviour. IRT can include a discrimination parameter to correct for systematic biases, but this discrimination parameter corrects the item response process for all biases and generally is uninformative about the potential sources of bias. Various biases will impact the students' item responses, such as social desirability and stereotypical views (Kenny, 1994). As we will detail next, generalizability theory provides a framework for examining such influences on item ratings.

<sup>2</sup> Rasch-family models are applied to test the theoretical models, because Rasch model fit tests were developed to empirically examine hierarchical orderings in item ratings (Bond & Fox, 2007). Hence, if student perceptions are unbiased, then their responses could be used to locate the teacher on this latent novice–expert continuum.

# **4 Generalizability Theory**

Generalizability Theory (GT) extends Classical Test Theory (CTT) by introducing the possibility of including systematic variance components (or facets) other than error and a teacher's "true score" (Brennan, 2001). The basic idea is that what is called error in CTT can be further sub-divided into systematic facets or sources of variability (Malloy, 2018) that potentially affect student perceptions of teaching quality. When such variance components are considered nuisance parameters, GT conceptually coincides with CTT, as it is viewed by Marsh (2007), for example. Traditionally, in the educational context, GT has been used to determine the number of tasks or raters that yield reliable test results (Shavelson & Webb, 2005). As such, the amount of error that tasks introduce or the degree of consensus between raters is typically GT's main focus. Yet, the strength of GT is that it can also be used to embrace and study 'error' in an attempt to learn more about how these additional sources of variability impact perceptions of social phenomena such as teaching.

# *4.1 A Practical Example Using GT and Student Ratings*

One of the best-known models in social science that applies GT to social perceptions and interactions is Kenny's Social Relations Model (SRM, 1994). The basic assumption of the SRM is that any rating of a social perception has, besides error, three potential sources: an actor or rater effect (i.e., due to the student who responds to an item), a partner or target effect (i.e., due to the teacher who is rated) and a relationship effect (variability introduced due to the specific combination of this student rating that specific teacher). The partner or target effect resembles what is taken to be the teacher's true score or true ability in CTT. The variance in partner effects captures the degree of consensus between students on a certain aspect of teaching quality. Stable response tendencies within students are captured in the actor effect. For example, some students are quick learners and may therefore readily indicate that they understand teacher explanations, irrespective of a specific teacher's quality. There can also be systematic variance in ratings due to the relationship between, or the specific pairing of, students and teachers. Thus, on top of a student's stable tendency to think that teachers can explain things well (rater effect) and the teacher's general ability to explain things (target effect), student A may have experienced instances where teacher B has explained content exceptionally well. This shared interaction history may affect student A's ratings over and above the rater and target effects (Mainhard et al., 2018).

GT and SRM can be applied at the item level, though they are more commonly applied at the construct level (Kenny, 1994, 1996; Kenny et al., 2006). Let us consider an example at the item level. Suppose that students complete the item "my teacher explains everything clearly to me"; then at the item level, the SRM is informative: about the target effect, namely, do students agree that some teachers explain things well while others are not? about the actor effect, namely, do some students tend to experience all teachers' explanations to be clear while other students tend to perceive all teachers' explanations as hard to understand? and about the dyadic effects, namely, do some students experience the teacher's explanation to be clear over and above their personal actor effect and that teacher's target effect? Note that the actor and relationship variances would be considered as error in CTT. The variability found in these sources can then be explained with predictors, as in regression analysis. For example, students' actor effects may be explained by their general academic ability and teachers' target effects by years of experience. Relationship effects may occur, for example, because some teachers think that certain students require a certain kind of explanation to understand the subject matter.

# *4.2 Advantages and Limitations of Generalizability Theory*

An advantage of dealing with student ratings of teaching quality according to the GT approach is that it is a relatively simple extension of the better-known CTT. Those acquainted with multilevel analyses will find GT quite straightforward (Kenny et al., 2006). Conceptually, GT is more informative about potential variables that impact students' item responses. When items barely show stable variance between students, the responses are only minimally affected by students' personal characteristics and answer tendencies.

However, compared to IRT, GT puts little emphasis on how item ratings can be organized into a broader representation of teaching. Like CTT, GT is applied to sets of items that have a similar association with the latent construct. Further, the GT approach requires complex data sets. It cannot be applied with datasets that pair one class with a teacher. Instead, students need to complete a questionnaire for several teachers, and teachers need to be rated by several classes (see Mainhard et al., 2018 for an example).

# **5 Discussion**

In this chapter, three dominant psychometric theories were discussed within the domain of research on the validity and reliability of student perceptions of teaching quality: Classical Test Theory, Item Response Theory and Generalizability Theory. While each of these models has its specific advantages and disadvantages, together they shed more complete light on what constitutes and determines students' perceptions of teaching quality, disentangling true scores from error, and distinguishing between more systematic and more random sources of variation in perceptions. Together, they present a nuanced and complex picture of what makes a (student) perception, and also how it can be used in research.

The main message of the chapter is that the choice of a psychometric model is not merely a technical matter, but also has implications for how the nature of perceptions is conceptualized. For example, statistical techniques or software are tools that can be of help, but they depend on the specific theory about what teaching (quality) is and what dimensions or constructs and their interrelationships underlie such behaviour. Regardless of the three theories described in this chapter, many instruments measuring student perceptions are based on effectiveness research. It mainly includes variables that have been found to be associated with student outcomes in correlational research, rather than specifying a structure in and between different dimensions of teaching and their likelihood of (co-)occurring (Skourdoumbis & Gale, 2013; Wrigley, 2004). For this purpose, CTT can be applied. Furthermore, many instruments are based on the frequency of occurrence of behaviours, assuming an order or singular dimension in these occurrences that is based on difficulty, routine or other phenomena, which is linked to IRT (Maulana & Helms-Lorenz, 2016; den Brok et al., 2018). However, others have argued that teaching quality is multidimensional in nature, with behaviours being interpretable from various perspectives and adding value to different outcomes at the same time (Doyle, 1986; den Brok, 2001; den Brok et al., 2004; Shuell, 1996). GT can be applied here.

One may argue that basing a theory about teaching quality on the actual presence of behaviour or association with existing student outcomes is conservative, and does not allow exploration of new teaching methods, new organisational forms of education or alternative learning outcomes. However, assumptions behind the occurrence of behaviours may differ depending on the type of perspective taken on teaching, as may their theoretical underpinnings. For example, many interactional theories assume two independent dimensions behind teaching, that order components of behaviour in circumplex structures with specific patterns and interrelations between behaviours (or items) (Fabrigar et al., 1997; Gurtman & Pincus, 2000; Wubbels et al., 2006). The more specified theories are, the easier they can be tested statistically, as many programmes assume or ask for specific relations to be tested when studying perceptions; consider, for example, structural equation modelling, confirmatory factor analyses, IRT analysis or latent variable analysis (den Brok et al., 2018).

# **6 Putting it all Together**

With this chapter, we hope to have provided more insight into the interesting, yet complicated, world of student perceptions of teaching quality. In conclusion, we have a few take-away messages for researchers interested in using student perceptions of teaching.

First, as aforementioned, it is important to be specific about the underlying assumptions one has about the nature of the student perceptions one is interested in. These assumptions should be grounded in prior research conducted on perceptions of the particular teaching behaviours one is interested in. For example, are the perceptions expected to vary considerably between teachers, classes or schools? Are the perceptions likely to evoke certain psychological processes, such as social desirability or stereotypical responses? Are the behaviours expected to be familiar or unfamiliar to perceivers? Depending on what is known or deemed relevant, researchers can choose between one or several of the theories mentioned in this chapter.

Second, it is important to be specific about the wording of the items capturing the perceptions, as wording may lead to differences in response patterns, and thereby differences in sources of variance that may occur, related to either perceiver, object or the relation between them. Typically, researchers are not that conscious about the choices and assumptions they make about perceptions and the wording they use.

Third, it is important to conceptualize and make explicit the different dimensions or constructs one is interested in and the expected relationships between them, preferably based on theory (and empirical results). As this chapter has shown, constructs may relate to each other in terms of difficulty or chance of occurrence (as with simplex structures), but also in terms of relatedness or independence (as with circumplex structures).

When researchers take all of these reflections into account, interesting insights may be obtained by collecting student perceptions of teaching, and by comparing these with, for example, the perceptions of others, such as teachers themselves. The present chapter provides an overview of techniques and three major theories that may be used to analyse and conceptualize such perceptions.

# **References**


**Hannah Bijlsma** is a researcher (Ph.D.) at the section of Teacher Professionalization at the University of Twente (the Netherlands) and a primary school teacher (grade 1). Her research focuses on measuring teaching quality and on the use of student perceptions of teaching quality in school contexts. In 2016 she founded a professional association for academic primary school teachers, of which she has been chairman for about five years. She is now a board member of the International Congress for School Effectiveness and Improvement (ICSEI) and a board member of the EARLI SIG on School Effectiveness and Improvement.

**Rikkert van der Lans** is a postdoctoral researcher currently working at the Department of Child and Adolescent Psychiatry, Curium-Leiden (the Netherlands). Previously he worked as a postdoctoral researcher at the department of teacher education of the University of Groningen, as a lecturer in methods and statistics at the department of educational sciences (GION) of University of Groningen, and as lecturer in the field of psychometrics at the department of methods and statistics of the University of Tilburg (the Netherlands). His research focuses on the evaluation of professional development and the psychometric assessment of quality tests.

**Tim Mainhard** is Associate Professor at the Department of Education at Utrecht University (the Netherlands). His research focuses on social dynamics in educational settings—specifically teacher-student interactions and relationships—and their impact on student and teacher outcomes, such as emotions, motivation, and academic achievement. Tim teaches classroom management courses at the Utrecht Graduate School for Teaching in both the primary and secondary teacher education programs. Tim has been chair of the Classroom Management Special Interest Group of the American Educational Research Association and is an associate editor for the journal Learning and Instruction.

**Perry den Brok** is full Professor and chair of the Education and Learning Sciences group at Wageningen University and Research (the Netherlands), and chair of the 4TU Centre for Engineering Education. His research focuses on educational innovation in higher education, teacher learning and professional development, teacher-student interpersonal relationships, and educational learning environments—both in-class as well as out-of-school learning environments. He was European editor of the Learning Environments Research journal (Springer) for well over 10 years. He has published several review articles and book chapters on teacher effectiveness and teacher collaborative learning.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 3 Student Perceptions of Teaching Quality: Dimensionality and Halo Effects**

**Sebastian Röhl and Wolfram Rollett**

**Abstract** This chapter deals with the factorial structure of survey instruments for student perception of teaching quality. Often, high intercorrelations occur between different theoretically postulated teaching quality dimensions; other analyses point to a single unified factor in student perceptions of teaching quality, seemingly reflecting a "general impression" instead of a differentiated judgment. At the same time, findings from research on social judgment processes and from classroom research indicate that the teachers' communion (warmth or cooperation) as well as students' general subject interest can be important biasing factors in the sense of halo effects in student ratings of teaching quality. After presenting an overview of studies on the dimensionality of various survey instruments, we discuss whether aggregated data is impacted by an overall "general impression". We confirmed this hypothesis using a sample of *N* = 1056 students from 50 secondary school classes. Moreover, this general impression could be explained at student and class level to a large extent by students' perception of the teacher's communion. Student general subject interest showed a medium effect but only at the individual level. These findings indicate that student perceptions of teaching quality dimensions are indeed influenced by a general impression which can be explained largely by teacher's communion.

**Keywords** Halo effect · Social judgment · Communion · Student feedback · Teaching quality

S. Röhl (B) · W. Rollett

Institute for Educational Sciences, University of Education, Freiburg, Germany e-mail: sebastian.roehl@ph-freiburg.de

W. Rollett e-mail: wolfram.rollett@ph-freiburg.de

# **1 Introduction**

An important precondition for using measurements of student perceptions of teaching is the validity of the collected data, both for use as informative feedback to teachers and for the collection of teaching quality within research studies. At the same time, measurements of student perceptions must be consistent with the theoretical assumptions made in the survey instrument with regard to professional competence or quality characteristics. While some other chapters of this volume deal with the question of perspective-specific characteristics of different feedback sources (Chap. 7 by Göllner et al. and Chap. 5 by van der Lans in this volume) or predictive validity (Chap. 6 by Schweig and Martinez in this volume), this chapter focuses on the extent to which students can actually distinguish between different theoretically postulated dimensions in their assessment of individual aspects of teaching. Subsequently, we examine whether the limited ability to differentiate can be explained by overlaying affective attitudes toward teaching or the teacher in the sense of a halo bias.

# *1.1 Dimensionality of Student Ratings on Teaching Quality*

Usually, questionnaires are used to collect student perceptions of teaching, in which a certain number of quality dimensions are differentiated and surveyed separately. However, most of the used instruments show high correlations between the theoretically distinguished quality dimensions. This is also the case when the theoretically postulated structure is confirmed by a confirmatory factor analysis. For example, Krammer et al. (2019) reported intercorrelations ranging from *r* = .81–.95 between the three dimensions "instructional quality", "teacher-student relationship", and "performance monitoring" at student and class level. Analyses of the "students' perceptions of instructional quality" (SPIQ) from Wisniewski et al. (2020) showed correlations from *r* = .63–.93 between the seven dimensions of the instrument. For primary schools, van der Scheer et al. (2019) reported correlations between *r* = .74 and *r* = .42 using an IRT model. One exception seems to be the survey instrument of Fauth et al. (2014), which only shows correlations between the dimensions of *r* = .47, .50, and .70 at the student and *r* = .23, .31, and .67 at the class level. However, a closer look reveals fundamental differences between item formulations of different quality dimensions. While the items of two of the dimensions start with "In our science class…", the third one uses "Our science teacher…".

Unfortunately, quite a number of the validation studies of student questionnaires on teaching quality did not report the correlations between the included scales (e.g. Bell & Aldridge, 2014; Tripod Education Partners, 2014), or they only tested the unidimensionality of single postulated scales (e.g. van Petegem et al., 2008).

At the same time, there are studies in which the theoretically postulated dimensions could be confirmed factor-analytically, but where they were highly charged with high standardized loadings to a latent second-order factor (e.g. Nelson et al., 2014 reports λ = .70–1.02).

A survey instrument which has been intensively analyzed in recent years is the "Tripod" questionnaire (Tripod Education Partners, 2014). Based on explorative factor analyses at the class level, the developers of the instrument postulate seven dimensions. However, in-depth analyses, which simultaneously take into account the nested multi-level structure with student and class level, consistently point to the unidimensionality of this questionnaire (Kuhfeld, 2017; Schweig, 2014; Wallace et al., 2016). A possible further dimension suggested by analyses is only weakly separated and is characterized by items with a certain type of item formulation (Kuhfeld, 2017). When examining other questionnaires, studies found unidimensionality for those with 16 items (Bijlsma et al., 2019) and 64 items (Maulana et al., 2015).

Overall, the question arises how to interpret the high statistical interrelations between theoretically well-distinguished dimensions of instructional quality in student surveys. A possible explanation, which we would like to examine in this chapter, is the impact of an affective overall attitude of students toward the teaching behavior of the evaluated teacher, resulting in biasing effects during the response process to individual items.

In research, different terms are used to describe the phenomenon whereby an overall attitude or impression influences and interferes with the assessment of individual teaching characteristics. For example, Clausen (2002) speaks of the effect of an "affective overall impression", while other authors use the terms "halo effect" (e.g. Haladyna & Hess, 1994; Wagner, 2008) or "general impression halo" (Lance et al., 1994).

# *1.2 Possible Explanations for Halo Effects in Student Ratings*

One promising path to a better insight into the phenomenon of high intercorrelations is to analyze the subjects' processing of items. Tourangeau et al. (2000) divide the survey response process into four main cognitive components or steps. In the first step, *comprehension*, the respondent needs to understand the item and to identify its focus. In the subsequent *retrieval* step, the respondent has to generate a retrieval strategy and cues, retrieve specific and generic memories, and fill in missing details. Next, a *judgment* component on the retrieved memories regarding the completeness and relevance of different memories takes place, which ends with an estimation for the subject of the item. In the last step, the person gives a *response* in the requested way, e.g. marking the box with the answering option fitting best.

In case that an overall affective attitude of satisfaction is present throughout the survey answering process, this influences the retrieval and judgment of the information related to the items. Therefore, the rating on a particular aspect is a combination of the overall satisfaction of the person and the actual judgment of the particular aspect (Borg, 2003).

Applied to the situation of students, this would mean that ratings on particular aspects of teaching quality consist of a non-differentiating overall satisfaction with the teacher or class, and a rating component which concerns the particular aspect.

According to the findings from research on social judgments, overall judgments on other persons are based on two fundamental dimensions of perception (Abele et al., 2008; Bakan, 1966). The first dimension, often called "agency", describes perception in terms of dominance, competence, or individualism. The second dimension, "communion", refers to perception concerning warmth, cooperation, social and community orientation. In the overall judgment of other people, the perceived communion plays a dominant role and is responsible for much larger parts of variance in character judgments (Abele & Bruckmüller, 2011).

The discussion about the overall impression—which dominates the students' judgments about teaching and teachers—points in a similar direction. A number of factors were discussed and examined which could well be subsumed under "communion". Wallace et al. (2016, p. 1859), for example, interpreted the overall factor as a judgment of such forms of teacher interaction which "makes them feel safe, respected, and competent". Kuhfeld (2017) explained the overall factor as an effect of students' perception of teachers' emotional support. Furthermore, findings also indicate that a higher teacher–student communion leads to higher desired learning behaviors of the students (Wubbels et al., 2015), and there is evidence of positive effects on learning achievement for learner-centered teaching approaches (Cornelius-White, 2007).

On the other hand, there are indications that an affective attitude toward the subject being taught could also cause biased ratings, and so the detection of an overall factor. In line with this assumption, findings from research on student ratings on teaching quality point to an influence of students' general interest in the school subject on the perception of teaching (Ditton, 2002; Eder & Bergmann, 2004; Mayr, 2006; Rahn et al., 2019). Students' general interest in the school subject is known to show a relatively stable pattern from secondary school onwards (Schurtz & Artelt, 2014), although current teaching characteristics may cause minor changes (Ferdinand, 2014; Lazarides et al., 2015). Findings from Rahn et al. (2019) point out that biasing effects of students' general interest in the school subject vary considerably between different subjects, particularly with regards to the distinction between compulsory and optional courses. But as Ferdinand (2014) showed, these distortions seem to be largely neutralized in the aggregation of the student ratings of a class. Research findings from higher education also support biasing effects of the perception of the teacher as well as of the subject. For example, in the study by Greimel-Fuhrmann (2014) the interest in the subject and the teachers' level of student orientation proved to be predictive for the students' overall rating of teaching quality.

In summary, both the student-perceived communion of a teacher and the general interest in the subject could create an affective overall impression (Clausen, 2002), which as a "general impression halo" (Lance et al., 1994) overlays the student ratings of the individual quality dimensions. This could explain the low statistical separation between the dimensions of teaching quality. Therefore, in the following, we analyze the explainability of a halo bias in student ratings on teaching quality by these two factors in the context of secondary schools.

# **2 Empirical Part: Explaining Halo Effects in Student Ratings of Teaching Quality Through Students' Perception of Teachers' Communion and Interest in the Subject Being Taught**

This study focuses on the following research questions:

RQ1: To what extent can an overlaying second-order factor in the sense of a general impression halo be modeled superordinately to the various dimensions of teaching quality?

RQ2: Can this second-order factor in student ratings on teaching quality be explained by a) teachers' communion perceived by the students and/or b) students' overall subject-specific interest?

RQ3: To what extent can the strength of the correlational structure between the different dimensions of teaching quality be reduced by controlling for one or both of these factors?

These research questions are addressed at the individual as well as at the class level.

# *2.1 Methods and Sample*

## **2.1.1 Design and Sample**

Data used for the following analyses were collected from different secondary schools in the southwestern part of Germany, where teachers obtained student feedback on their teaching and classes. For research purposes, student feedback questionnaires were supplemented by instruments for the survey of teachers' communion and students' general interest in subject taught by the teacher. The sample comprises a total of *N* = 1056 students from 50 classes at lower track schools (*Werkrealschule*, 9.6%), middle track schools (*Realschule*, 35.5%), grammar and high schools (*Gymnasium*, 49.6%), and secondary comprehensive schools (*Gemeinschaftsschule*, 5.3%). The students belong to grades 5–6 (28.0%), 7–8 (20.3%), 9–10 (30.6%), and 10–13 (21.1%), and are aged between 10 and 19 years. Teachers' professional experience and gender were not surveyed for reasons of anonymity, but the sample included both young professionals and very experienced teachers, as well as female and male teachers. The teachers were free to choose the class and course in which they used the questionnaire. Therefore, the sample covers a wide range of taught subjects, including math, German, foreign languages, science, and history, but not physical education.


**Table 1** Measurement instruments

aFor the reliability estimator McDonald's Omega see McDonald (1999)

#### **2.1.2 Measures**

#### *Feedback Questionnaire on Teaching Quality (FQTQ)*

The *Feedback Questionnaire on Teaching Quality* (*FQTQ*, Röhl, 2015) is based on the characteristics of good teaching according to Meyer (2005), and includes 24 items with a four-level Likert format ("fully agree" to "disagree"). The aim of the instrument is to provide teachers with indications for improving their own teaching and classes. In total, the FQTQ assesses five quality dimensions of teaching: "Clarity of content and explanations", "Activation and use of adaptive methods", "Classroom and teaching management", "Individual care and kindness", and "Transparency of assessment" (see Table 1). All scales showed satisfactory to good reliability values using the reliability estimator ω (McDonald, 1999), which proved to be particularly reliable for use on short scales and in the context of structural equation modeling (Revelle & Zinbarg, 2009; Teo & Fan, 2013).<sup>1</sup> The formulations in the instrument are kept as low-inferent as possible (Wagner, 2008) and—in order to avoid problems of comprehension (Clausen, 2002)—are formulated positively throughout. In all quality dimensions, both ego- and web-references are used in the item wordings.

<sup>1</sup> McDonald's ω is based on the parameter estimates of the items in a factor model and represents the ratio of the variance due to the common factor to the total variance.

Confirmatory factor analyses indicated a good fit of the theoretically assumed structure with five factors (χ2(242) = 1597.8, *p* < .001, CFI = .989, TLI = .987, RMSEA = .014). This was also evident in comparative analyses using a model with one single overall factor, which resulted in less favorable fit statistics (χ2(252) = 3591.7, *p* < .001, CFI = .923, TLI = .916, RMSEA = .069).

To survey the students' perception of the "teacher communion", the scales "CD: helping/friendly" and "CS: understanding" of the "Questionnaire on Teacher Interaction" (Wubbels & Levy, 1991) were used, which reflect a high degree of this basic dimension. The response scale for the 12 items comprises five levels (from "1: never" to "5: always").

In addition, we measured students' overall subject-related interest using the two items "The subject itself interests me" and "I like the subject itself", using a five-point scale (from "very" to "not at all").

#### **2.1.3 Data Analysis**

The data was analyzed by means of single- and multi-level structural equation analyses using MPlus 8.4 (Muthen & Muthen, 2012–2019). Considering the ordinal level of the four- and five-point rating scales, the "categorical" option was used for the measurement models, which relies on polychoric correlations for the corresponding sub-models. At the same time, the used procedure models response behavior in the sense of a probabilistic latent trait analysis (Uebersax, 2010–15). We chose the robust Weighted Least Square Estimator (WLSMR) as the estimation method, which showed a high reliability for ordinally scaled measurement models in simulation studies (Flora & Curran, 2004). The clustered data structure was considered by the option "type = complex". The high number of parameters of the ordinal measurement models made it necessary to use the less computationally intensive Bayesian estimator for the subsequent multi-level analyses (Asparouhov & Muthen, 2012).

# *2.2 Findings*

#### **2.2.1 Modeling a Latent Second-Order Factor**

To examine research question 1 (whether a factor overlaying the dimensions of teaching quality can be modeled reflecting an overall impression) an SEM was specified in which the overall impression is represented as a latent second-order factor. Fit indices pointed to a good fit of the assumed structure (χ2(247) = 414.7, *p* < .001, CFI = .978, TLI = .975, RMSEA = .021, SRMR = .040). Analyses indicated medium to large loadings of the five teaching quality dimensions on the second-order factor (clarity: β = .928, methods: β = .984, classroom management: β = .739, care: β = .878, transparency: β = .772, *p* < .001 each).


\*\*\*p < .001

In the next step, the effects of the perceived communion of the teacher and students' interest in subject on the overall impression factor were determined to answer research question 2. For this purpose, three regression models were estimated at the student level. First, both possible influencing factors were analyzed individually (models 1 and 2), and then combined in a second step (model 3). The results are summarized in Table 2.

Model 1 examined the perceived teacher communion as an explanatory variable for the second-order factor overall impression. It shows a good model fit and explains more than 70% of the variance of the overall impression. A slightly inferior fit is shown by model 2, which tests students' general interest in subject as the source for the overlaying effect and explains 40%. With the assumption underlying model 3 that both influencing factors jointly explain the overall impression, 76% of the variance can be explained (see Fig. 1). The far greater proportion can therefore be explained by the perceived teacher communion. Both factors correlate with each other on a medium level (*r* = .53).

#### **2.2.2 Correlations Between Teaching Quality Dimensions Controlling for Students' Perception of Teacher Communion**

In order to analyze the effects on the intercorrelations between the various quality dimensions of teaching, the amount to which this can be explained by perceived teacher communion and general interest in subject was investigated analogous to the approach of Borg (2003) described above (research question 3). Therefore, a structural equation model was used to determine the direct effects of these factors on the items of teaching quality. This procedure extracted the variance component related to these factors, and only the remaining variance components were loaded onto the quality dimensions.

**Fig. 1** Structural equation model at student level explaining the overall factor by perceived teacher communion and students' general interest in subject (model 3). All loadings are standardized and significant at the *p* < .001-level

At the student level the model showed a good fit (χ2(291) = 477.3, RMSEA = .025, CFI = .979, TLI = 973, SRMR = .034). The loadings of the items on the quality dimensions remained (with two exceptions) significant (*p* < .05), but decreased substantially (average item loadings: clarity: .29, methods: .24, classroom management: .44, care: .21, transparency: .44). Whereas teacher communion showed highly significant effects on each of the 24 individual items, ranging from β = .21– .84 (*p* < .001), analysis of the general interest on subject revealed only eight much less significant effects (β = .10–.42, *p* < .05).

At the same time, the intercorrelations between the individual quality dimensions decreased substantially, and in some cases were no longer significant (Table 3). This is especially true for the dimensions "Individual Care and Kindness" and "Transparency of Assessment", which showed no or only low correlations with the other dimensions. The partially negative correlations of the care dimension can be understood as a suppression effect, since this dimension has the highest content overlap with communion.


5. Transparency of assessment .728\*\*\* .748\*\*\* .486\*\*\* .703\*\*\* X

**Table 3** Intercorrelations between the perceived dimensions of teaching quality at the student level. Below the diagonal = without control of the communion; above the diagonal = with control of the item-related effects of the communion

\**p* < .05, \*\**p* < .01, \*\*\**p* < .001

#### **2.2.3 Analyses at the Class Level**

With regard to the class level, model 3 was extended to a two-level model. The results on the effects at the student level remained almost constant compared to the previous findings. At the class level, the loadings of the individual teaching dimensions on the overall second-order factor showed similarly high values as at the student level (clarity: β = .86, methods: β = .95, classroom management: β = .60, care: β = .98, transparency: β = .67, *p* < .001 each). Interestingly, the effect of teacher communion on the overall factor was considerably higher (β = .87, *p* < .001), whereas interest on subject no longer showed any significant effect (β = .12, *p* = .198). Replicating the analysis of item-related effects of communion and general subject interest at the class level led to the almost complete elimination of significant item loadings on the dimensions of teaching quality.

# **3 Discussion**

The findings presented here indicate that an overall impression which overlays the perception of teaching quality can be modeled as a latent second-order factor. The modeled overall impression can be explained to a large extent by teacher communion perceived by the students. Students' general interest in the subject taught only shows significant effects at the individual level, and these effects are low. Thus, at the class level the general subject interest does not appear to have any relevant effect on the overall impression, and does not induce a bias for the assessment of teaching quality when the data is aggregated for classes. These results are in line with the findings from Ferdinand (2014). The findings also point to the existence of a "general impression halo" in accordance with Lance et al. (1994), which is based on an affective attitude—to a larger extent toward the teacher and to a lesser extent toward the subject being taught. Furthermore, the modest significant correlation between communion and interest in subject shows that there could be a reciprocal influence in students' perceptions of the subject and the teacher.

Thus, the affective overall impression reported in the literature seems to be predominantly based on students' perception of teachers' communion, which means that the teacher is perceived as being interested in the learning progress of all students and sympathetic to the needs of the learning group. These results show that the theory of social judgments (Abele & Bruckmüller, 2011) provides a valid framework for obtaining a better understanding of the processes of students' assessment of teaching and classes.

The control of direct item-related effects of teachers' communion shows that the high intercorrelations of the dimensions of teaching quality in which the halo effect manifests itself can be drastically reduced—in some cases even to an insignificant level. For the general interest in the subject taught this is only true to a much smaller extent.

However, it can be theoretically argued that a high quality of teaching can indeed go hand in hand with students' perception of a high teacher communion. In this case, students' perception of a high communion of the teacher could be based on an inner attitude of respect and empathy from the teacher, which in turn contributes to an overall higher quality in the different teaching dimensions; conversely, a less empathic attitude from the teacher could lead to a lower quality of teaching (Tausch, 2007). Thus, this inner attitude could lead to teaching being better adapted to the students' learning (from a methodological-didactical point of view), and also to a more comprehensible performance assessment for the students. Conversely, didactics and methodology which are more strongly oriented toward the students could lead to a higher assessment of teacher communion. In this case, the overlaying affective overall impression by the perceived teacher communion would not represent a problematic bias in the context of student feedback or the measurement of teaching quality. As a result of higher teaching quality, it is a central element for its valid measurement.

On the other hand, the perception of a high communion of a teacher could also lead to teaching which has qualitative deficits (with regard to pedagogical action in class) being assessed more positively by the students than might be appropriate. In this case, a weaker quality among teachers with a high communion would be masked by this perception. In other words, in such cases there could actually be a severe bias influencing the measurement of teaching quality. This could explain why many studies showed no or only minor predictive effects of the teaching quality measured by student surveys on learning achievement, and why often large differences in the quality perception of students and external observers are reported (see, for example, Chap. 7 by van der Lans in this volume; Fauth et al., 2014; Kuhfeld, 2017). If this is the case, a way out could be to control the perceived student communion through partial regressions, as was done in the analyses presented here.

However, both phenomena could also exist, which means that on the one hand there are good teachers with high communion and worse teachers with lower communion, for whom the effect described here is not a bias; on the other hand, there are also situations in which good teachers with lower communion are rated worse by the students and worse teachers with high communion are rated much better. In this case, there is a need to clarify whether indicators can be developed to distinguish between these two situations. These could then be used as a supplement to the classical evaluation procedure for feedback to teachers.

Further research is needed to address the issues raised in this chapter and, if necessary, to develop methods for correcting the measurement of teaching quality through student surveys. This would require longitudinal studies of the perceived quality of teaching over a period of joint work by teachers and classes. In such studies, it would be especially valuable if the dimensions of teaching quality and teacher communion were also assessed by external observers. At the same time, realizing experimental study designs could also be fruitful in which the same teacher's statements with varying communion, e.g. as video vignettes with actors, are rated by students. In addition, studies controlling for the use of ego- and web-references in the item wordings could be helpful in getting a deeper insight into this effect (den Brok et al., 2006).

When using student ratings as classroom feedback, teachers should be aware that there is an overlaying halo effect related to their communion. Teachers perceived more positively by learners in this way should therefore be more critical of the feedback received. Conversely, relatively unfavorable ratings, which can be associated with a lower perception of communion, are an indication to consider and improve related aspects of teaching quality. For a reliable assessment and control of such effects, it would be advantageous to supplement the questionnaires on teaching quality used in practice and research with a scale for measuring the teacher communion perceived by the students. If such information is not available, teachers should bear in mind that the evaluation of the data at the individual level might be less confounded with their communion than the aggregated data at the classroom level. So, it might be advisable to evaluate the data on both levels to gain a better insight into how one's teaching practice is perceived by the students.

# **References**


successful teacher action]. *Schweizerische Zeitschrift Für Bildungswissenschaften, 28*(2), 227– 241.

McDonald, R. P. (1999). *Test theory: A unified treatment*. L. Erlbaum Associates.


**Sebastian Röhl** has been an Academic Assistant in the Department of Educational Science at the University of Education Freiburg and is currently a Postdoctoral Researcher in the Institute of Education at Tübingen University (Germany). Before that he worked for more than 10 years as a grammar school teacher, school development consultant, and in teacher training. Among other areas, he conducts research in the fields of teaching development and teacher professionalization through feedback, social networks in inclusive school classes, as well as teachers' religiosity and its impact on professionalism. In addition, he is the Director of an in-service professional master's study program for teaching and school development.

**Wolfram Rollett** is a Professor of Empirical Educational Research at the University of Education Freiburg and the Freiburg Advanced Center of Education (FACE). Previously he worked as a researcher and lecturer in the field of Educational Science and Psychology at the Universities of Potsdam, Braunschweig, Dortmund, and Wuppertal. His research focuses on school development processes, the quality of extra- and co-curricular activities, educational effectiveness, and classroom composition.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 4 The Quality of Student Perception Questionnaires: A Systematic Review**

#### **Hannah Bijlsma**

**Abstract** Student perceptions of teaching are promising for measuring the quality of teaching in primary and secondary education. However, generating valid and reliable measurements when using a student perception questionnaire (SPQ) is not self-evident. Many authors have pointed to issues that need to be taken into account when developing, selecting, and using an SPQ in order to generate valid and reliable scores. In this study, 22 SPQs that met the inclusion criteria used in the literature search were systematically evaluated by two reviewers. The reviewers were most positive about the theoretical basis of the SPQs and about the quality of the SPQ materials. According to their evaluation, most SPQs also had acceptable reliability and construct validity. However, norm information about the quality rating measures was often lacking and few sampling specifications were provided. Information about the features of the SPQs, if available, was also often not presented in an accessible way by the instrument developers (e.g., in a user manual), making it difficult for potential SPQ users to obtain an overview of the qualities of available SPQs in order to decide which SPQs best fit their own context and intended use. It is suggested to create an international database of SPQs and to develop a standardized evaluation framework to evaluate the SPQ qualities in order to provide potential users with the information they need to make a well-informed choice of an SPQ.

**Keywords** Student perception questionnaires · Teaching quality · Systematic review

# **1 Introduction**

Student perceptions of teaching are promising as a way to measure the quality of teaching in primary and secondary education (Ferguson, 2012; Ferguson & Danielson, 2014; Schulz et al., 2014; van der Scheer et al., 2018). The scores provided by students with regard to their teachers' teaching can be used for research

H. Bijlsma (B)

Section of Teacher Professionalization, University of Twente, Enschede, the Netherlands e-mail: h.j.e.bijlsma@utwente.nl

<sup>©</sup> The Author(s) 2021 W. Rollett et al. (eds.), *Student Feedback on Teaching in Schools*, https://doi.org/10.1007/978-3-030-75150-0\_4

and accountability purposes, for example, or for teacher professional development (Timperley et al., 2007). A student perception questionnaire (SPQ) is often used for collecting student perceptions of teaching quality, and usually consists of a set of items about the quality of teaching that students have to respond to using a numeric scale. An SPQ also involves the information and activities required to use the instrument as intended. Therefore, ideally, an SPQ also includes a user manual, scoring rules, and sampling specifications (for example, specifications regarding the subject of the lesson). Although the use of SPQs is not new, several studies (Bijlsma et al., under review; Lüdtke et al., 2006; van der Scheer et al., 2018; Wallace et al., 2016) have recently generated renewed interest in reliability and validity issues surrounding SPQs and other teacher evaluation approaches, such as classroom observation systems (Bell et al., 2018; Dobbelaer, 2019).

Generating valid and reliable scores using an SPQ is not guaranteed. Many authors (e.g., Bijlsma et al., under review; van der Lans & Maulana, 2018; and Chap. 7 by Göllner et al., in this volume) have pointed to issues that need to be taken into account when developing and/or using an SPQ in order to generate valid and reliable scores. These include issues regarding the theoretical basis of the items and the constructs that the SPQ aims to measure. However, these issues are often not (fully) addressed by SPQ users or developers, bringing the reliability and validity of the student scores into question. Moreover, there is no overview of the student perception questionnaires available for use in primary and secondary education that identifies what their psychometric characteristics are and what teaching quality constructs they aim to measure. Therefore, in this study, a systematic review was conducted of SPQs in primary and secondary education. The SPQs found were reviewed, based on an evaluation framework developed for this study. An overview with general information about the SPQs and the results of the evaluation and an overview with the constructs measured by the SPQs are presented. The overviews contribute to an increased awareness of the complexity of SPQs by developers and users, and to the deliberate design and use of SPQs. Note that in this chapter, a clear or standard definition of teaching quality is not presented, because consensus about its conceptualization or definition across SPQs is minimal (Cohen & Goldhaber, 2016).

# **2 The Evaluation Framework**

A selection of SPQs was reviewed using an evaluation framework (available in Dutch) consisting of seven standards: the theoretical basis of the questionnaire, the quality of the questionnaire, the quality of the manual, norms, reliability, construct validity, and criterion validity. The standards in the framework were drawn from two strands of literature: the literature on SPQs and the literature on testing and performance assessment (the COTAN evaluation standards for test quality, Evers et al., 20101). The evaluation framework and the underlying theory are outlined in the following paragraphs. Additionally, SPQ assessment purposes are distinguished and discussed.

# *2.1 Evaluation Standard 1—The Theoretical Basis of the Questionnaire*

Each SPQ in this review includes a scoring format or tool consisting of items that are scored on a rating scale. In most SPQs, several items can form a construct2 (Marsh, 2007), an aspect of teaching quality as perceived by students (Maulana & Helms-Lorenz, 2016). All SPQs measure the quality of teaching; however, they can focus on different constructs that are perceived by students. Based on several meta-analyses related to effective teaching (Praetorius et al., 2018; Bell et al., 2018; Creemers, 1994; Pianta & Hamre, 2006; Sammons et al., 1995), nine constructs that are known to be effective for student learning are distinguished in this study: a safe and stimulating classroom climate, classroom management, the involvement and motivation of students, explanation of subject matter, the quality of subject-matter representation, cognitive activation, assessment for learning, differentiated instruction, and teaching learning strategies and student self-regulation. The constructs in SPQs can be derived from different theories, research, or standards, but all should have a solid scientific basis (American Educational Research Association [AERA] et al., 1999) and the items should cover the theoretical constructs (Evers et al., 2010). Although the theoretical basis of the questionnaires was evaluated in this review, it was not feasible to evaluate the quality of the research underlying the questionnaire, as well.

# *2.2 Evaluation Standard 2—Quality of the Questionnaires*

To evaluate the quality of the questionnaires (this corresponds to "material" in general psychological tests), the item design of the SPQ is considered and it is determined whether the scoring system and procedure are standardized. The items on the SPQ can be subject-specific (e.g., designed to capture the quality of mathematics teaching) or generic (items that can be used across subjects), and can focus on teachers' actions, students' actions, or both (Bell et al., 2018). The number of items included in the

<sup>1</sup> The COTAN evaluation standards are used by the Dutch Committee on Tests and Testing (COTAN) to evaluate the quality of psychological tests available in the Netherlands. COTAN has audited over 750 tests published for professional use.

<sup>2</sup> Others have used terms such as dimension, scale, or pattern to refer to what I am calling "construct," but in my opinion such terms do not capture well the conceptual link with the aspects of teaching quality perceived by the student.

scoring tools can differ as well as the response categories in the rating scale. Strong (2011) pointed out that a large number of items can be problematic for students because "there is an upper limit of a rater's ability to match his or her responses to a given set of stimuli" (the channel capacity; Strong, 2011, p. 88). Although utilizing a small number of items may reduce students' cognitive load and be adequate for evaluating teaching quality, more items enable providing richer feedback to teachers on their strengths and weaknesses, which is needed for improvement (Marzano, 2012).

# *2.3 Evaluation Standard 3—The Quality of the Manual*

A description of the scoring tools should enable potential users to judge whether an SPQ is suitable for their purposes and should therefore include a description of the constructs the SPQ aims to measure, the type of use for which the SPQ has been developed and who/what can be observed by using the SPQ (Evers et al., 2010).

# *2.4 Evaluation Standard 4—Norms*

Numeric ratings usually result in a raw score, which is partly determined by the characteristics of the SPQ. The norms evaluation standard evaluates whether the SPQ provides a meaningful interpretation of its results (Evers et al., 2010). Two ways of "scaling" or categorizing can be used to interpret the raw scores (American Educational Research Association (AERA), (APA), and (NCME), 1999). First, a set of scaled norms may be derived from the distribution of the raw scores of a reference group. This is called norm-referenced interpretation (Drenth & Sijtsma, 2006; Bechger et al., 2009). Second, standards may be derived from a domain or subject matter to be mastered or from the results of empirical validity research: the domain-referenced interpretation and the criterion-referenced interpretation, respectively (Berk, 1986; Vos & Knuver, 2000). The raw scores are given meaning by providing norms or standards and it makes the SPQ more user-friendly.

# *2.5 Evaluation Standard 5—Reliability*

Providing reliability evidence is primarily the responsibility of SPQ developers since prospective users need this information to make an informed choice among alternative instruments or other measurement approaches and prospective users will generally be unable to conduct reliability studies prior to the operational use of an SPQ (AERA et al., 1999). In this evaluation, the assessment purpose was taken into account. A higher reliability coefficient (or a similar measure) is more critical for high-stakes decisions (e.g., tenure decisions) than for low-stakes decisions such as teacher professional development activities. The quality of the research was also taken into account (as suggested by Evers et al., 2010), namely, whether the (analysis) procedures followed were correct, whether the research had been conducted in the target group of the SPQ (e.g., an SPQ designed for primary education should be investigated in primary education) and whether developers provided enough information to thoroughly judge the reliability of the SPQ scores. Various types of reliability measurements can be used, such as parallel-form and split-half reliability, reliability on the basis of inter-item covariance, test-retest reliability, interrater reliability, generalization theory, and structural equation models. Other methods for reliability testing are Guttman's lambda2 (Guttman, 1945) and the greatest lower bound (glb; ten Berge & Socan, 2004). For a more detailed description of Generalizability Theory (as well as Classical Test Theory and Item Response Theory), see Chap. 2 by Bijlsma et al., in this volume.

# *2.6 Evaluation Standard 6—Construct Validity*

Validity reflects the extent to which an SPQ fulfills its purpose (the measurement of a specific construct; Drenth & Sijtsma, 2006). This shows whether the instrument is useful or not. Validity is "a matter of degree rather than an all-or-none property and validation is an unending process" (Nunnally, 1967, p. 75). Several methods can be used to support construct validity such as research on the (uni)dimensionality of the items (explanatory or confirmatory factor analyses), the quality of the items and the fit of the items to a model (e.g., IRT model, see Chap. 2 by Bijlsma et al., in this volume), and the correlations between relevant scales.

# *2.7 Evaluation Standard 7—Criterion Validity*

To demonstrate the relationship of variables (e.g., do the SPQ scores for teachers' teaching quality relate to student achievement), criterion validity (also called predictive validity) can be investigated. Some SPQs are developed specifically for the purpose of investigating criterion validity. In addition, aspects of validity could also be captured through interviews with the students, which could relate to the content and factual accuracy of their understanding of the items. But this type of validity research is not always conducted.

# *2.8 Possible SPQ Assessment Purposes*

The evaluation standards described above are based on the circumstances of psychological testing procedures and not, for example, on their use as a pure feedback instrument via reporting student perceptions of teaching. The necessity of meeting each standard is related to the assessment purpose of the SPQ. For example, norms are not necessarily needed when you use the student perception scores as feedback to teachers, nor are construct and criterion validity (Kane, 2006). In the context of this chapter, I distinguish four SPQ assessment purposes (Mislevy, 2013; Bell, 2019): assessment for research practice (e.g., for measuring intervention effects), assessment as a feedback loop (e.g., for improvement at the teacher level [teaching quality] or school level), assessment as an evidentiary argument (e.g., to develop claims that are supported by measurements for personnel decisions), and assessment as a measurement (e.g., to specify and test assessment models as a way to work toward models that enable representation of real-world resources). In the results section, I overview which SPQs seem suitable for which assessment purpose.

# **3 Method**

Before searching for SPQs, a review protocol was developed in collaboration with an information specialist. This protocol included the aim of the review, the research questions, the inclusion and exclusion criteria, and the search strategy.

# *3.1 Inclusion and Exclusion Criteria for Questionnaires*

It specifies the characteristics of populations, interventions, contexts, and outcomes:

*Population*: The SPQs were designed for use in primary and secondary education. *Intervention*: The SPQs measure teaching quality. Aspects of learning climate and

classroom climate were included as well as the extent to which students are involved in their lessons.

*Context*: The SPQs were designed for teacher-oriented lessons, such as mathematics, language, or reading.

*Outcomes of interest*: SPQs were only included if research had been done on the psychometric quality of the instrument. No exclusions were made based on the methodology of the study.

Other criteria for inclusion were forms of publication, language, and time period (Littell et al., 2008).

*Forms of publications*: To deal with publication bias, only peer-reviewed publications, dissertations, and unpublished articles were included.

*Language*: No criteria were set based on the country where the instrument was developed. However, for practical reasons, only questionnaires published in Dutch and English were included.

*Time period*: Only questionnaires developed between 1970 and 2016 were included. SPQs developed earlier than 1970 would be very outdated and inappropriate for the current educational context, while 2016 was the year in which the research project took place.

# *3.2 Search Strategy*

Littell et al. (2008) described a procedure to efficiently plan a systematic literature review. The steps of this procedure are: searching in bibliographic and scientific databases, using terms and strings; searching for sources of unpublished articles and dissertations, and asking for personal contacts. Based on the experiences of fellow researchers, the step of "hand searching" was not conducted, because it turned out that nothing relevant was found that way (Dobbelaer et al., 2015).

Following this procedure, first, six databases were searched (ERIC, PsycInfo, Web of Science, ProQuest, Scopus, and Narcis3) with the last search conducted in April 2016. The search terms included: evaluation, perception, student, teacher, education, school, and psychometric. Synonyms for each individual concept were combined with the Boolean operator *OR* and all the different concept lists were combined using the Boolean operator *AND*. For Narcis, however, the same terms translated to Dutch were used for the search. The search terms are presented in the Appendix. All searches were run in all the databases in everything (title, abstracts, keywords) except full text. Backward-snowballing (reviewing the reference list of identified articles) resulted in more relevant SPQ publications and grey literature (unpublished articles and dissertations). The contacted field experts were all researchers conducting research into teaching evaluation, or had developed or used an SPQ in their research. A total of 92 researchers from 13 different countries were contacted for the purpose of both a systematic review of classroom observation systems (Dobbelaer, 2019) and for the current study. They were asked to name SPQ that met the inclusion criteria.

For the systematic review, the PRISMA reporting guidelines were followed (Liberati et al., 2009; Moher et al., 2009). A total of 1.544 publications were identified. After screening abstracts and titles, to check whether the articles were specifically about newly developed SPQs and not (for example about research done with an already available SPQ), 290 articles remained. Duplicates (56) were then removed and an additional 160 articles were removed after scanning the texts and reading the abstracts and titles more closely. 74 publications were then selected for full-text review. As a result of the full-text review, 49 additional publications were omitted based on the inclusion/exclusion criteria. Other publications, in which research had been done *with* the instrument, were read as well, to address practicality and usability

<sup>3</sup> Dutch database for scientific research.

issues or—if available—interrater reliability. These were in total 25 questionnaires. The whole search was documented, using Refworks and Excel.

During the evaluation phase of this study, three questionnaires were determined to be not possible to evaluate. The Science-Technology Learning Environment Questionnaire (STLEQ) could not be evaluated because it did not meet the inclusion criterion of context. Comenius was the second questionnaire that could not be evaluated. This was due to language difficulties, which were not noticed during the search. The questionnaire was developed in Serbian and English, but the user guide and all other available sources were only available in Serbian. After contact with the developer, the decision was made not to evaluate the questionnaire. Lastly, the Learning Environment Inventory (LEI) was too out-of-date. The research done with the questionnaire was carried out in 1982, but it was originally developed in 1967. Both evaluators concluded that the questionnaire was too outdated, so it was excluded from the list. The elimination of these three questionnaires left a total of 22 questionnaires that were evaluated. Figure 1 shows the full flowchart of the PRISMA guidelines.

# *3.3 Description of the SPQ*

A range of sources provided information about the SPQs: user manuals, manuals for SPQ use in research projects, peer-reviewed publications, dissertations, grey literature, websites, and personal contact with the authors. All relevant information was described in an overview for each SPQ, which included information on the general characteristics of the SPQ and the results of the evaluation. Additionally, the SPQ assessment purposes were analyzed and presented together with the descriptives for the SPQs.

# *3.4 Evaluation Procedure*

The SPQs were reviewed based on the evaluation framework described above. After instruction in the use of the evaluation framework, two reviewers independently evaluated two SPQs. Based on the results, the two reviewers clarified all uncertainties and, where necessary, refined the evaluation framework. Next, two strongly differing questionnaires (in terms of development date, content, and intended use) were evaluated by both reviewers. The interrater reliability was then 0.87 (Kappa). Because this is considered high enough interrater reliability, the two reviewers independently evaluated all the remaining SPQs. After evaluating all SPQs, the scores were discussed by the two reviewers to come to an agreement about the final judgment.

**Fig. 1** Flowchart of the PRISMA guidelines followed in this study

# *3.5 Evaluation Framework*

The evaluation framework consists of seven evaluation standards (as described earlier in this chapter) with 26 questions about the quality of the SPQ (ranging from 3 to 5 questions per standards). The reviewers scored all questions on a dichotomous scale (met or not met). The reviewers were instructed to assign the score "not met" if there was not enough evidence to evaluate a standard. The reviewers were required to give a reason for every response.

# *3.6 Analysis*

The results of this review are descriptively consisting of description of the SPQs based on information about the SPQs and the evaluation framework, descriptive statistics from the results of the review based on the evaluation standards in the evaluation framework, and, if relevant, description of the reasons for a score. These reasons were analyzed qualitatively by open and axial coding.

# **4 Results**

# *4.1 General Information*

An overview of the 22 questionnaires can be found in Table 1. The number of items differed, ranging from 16 to 96. Eleven questionnaires used a 5-point rating scale, seven questionnaires a 4-point rating scale; one questionnaire a 3-point rating scale, two questionnaires a 2-point rating scale, and one questionnaire used both a 3 and a 5-point rating scale. Fifteen questionnaires were developed for secondary education, four for primary education, and three could be used for both (according to the authors of the questionnaire). Questionnaires differed in their date of development. Eleven questionnaires were developed between 2005 and 2014, five between 1995 and 2004, three questionnaires between 1985 and 1994, and the oldest three between 1970 and 1994. SPQ assessment purposes differed across the questionnaires. All SPQs were intended for research practices, while some were additionally used as a feedback tool (5) or for measurement purposes (4). Only two SPQs were developed to make an evidentiary argument (2). Remarkable, criterion validity could be evaluated in SPQs that were intended for measurement assessment (e.g., intended to specify and test assessment models as a way to work toward models that enable representation of realworld resources). All SPQs measured at least one of the teaching quality constructs described earlier (a safe and stimulating classroom climate, classroom management, the involvement and motivation of students, explanation of subject matter, the quality of subject-matter representation, cognitive activation, assessment for learning, differentiated instruction, and teaching learning strategies and student self-regulation). See Table 2 for an overview of the teaching quality constructs measured by the SPQs.

# *4.2 Evaluation Results*

The evaluation results can be found in Table 1. Regarding the theoretical basis of the questionnaire as specified in the evaluation framework (standard one), all SPQs met the evaluation standard. This means that the constructs could be derived from different theories, researches, or standards, but all had a solid scientific basis and the




(continued)


*Note* aPE = Primary education; SE = Secondary education. bAssess. purp. = SPQ Assessment purpose. RP = Research practice; FL = Feedback loop; EA = Evidentiary argument;M= Measurement. cTheory = The theoretical basis of the instrument. dMaterial = Quality of the material. eManual = Quality of the manual



*Note* aClimate = Safe and stimulating learning climate. bManagement = Classroom management. cMotivation = Involvement/motivation of students. dExplanation = Explanation of subject matter. eQuality of representation = Quality of subject representation. fCognitive activation = Cognitive activation of the students during the lesson. gAfL = Assessment for Learning. hDifferentiation = Differentiated/adaptive instruction. iLearning Strategies = Teaching and learning strategies and student self-regulation items covered the theoretical constructs. Although the theoretical basis of the SPQs was evaluated in this review, it was not feasible to evaluate the quality of the research underlying the questionnaire, as well. Constructs also differed across SPQs in the way they were operationalized in items.

Standard two of the evaluation framework (the quality of the material) was met by 86% of the SPQs. A list of items (material) was useful for evaluation and two SPQs lacked such a list (except for items that were included in publications). Evaluation standard three considered the quality of the user guide. A user guide was available in five questionnaires, from which one out of the five was considered inadequate. Norms for the SPQ scores (standard four) were only given in two of the 22 SPQs. In one of the two cases in which the norms were given, norms were not satisfactory (ZEBO). Sample sizes were small and the research conducted on the norms was weak.

Regarding the psychometric quality of the SPQs (standard 5), for all but one SPQ, information about reliability was available and considered acceptable. A wide range of measurements was reported, for example, percentages of exact agreement between students, Cohen's kappa coefficient, or Cronbach's alpha correlation coefficient. The different measures of reliability are hard to compare because some imply absolute student score agreement while others imply student score consistency. For example, Fraser et al. (1993) reported the internal consistency using Cronbach's alpha (.70– .95), while van Petegem et al. (2008) tested the split-half reliability of the results using the Spearman-Brown coefficient on the student level (*rs* .77–.90) and class level (*rs* .84–.95) and they presented the r-output of the Mokken scaling analyses as yet another measure of reliability. However, for most SPQs, an acceptable reliability coefficient was reported in at least one of the publications.

For all but two SPQs, information about construct validity (standard 6) was available provided by an exploratory or confirmatory factor analysis (EFA or CFA). Factor loadings on the teaching quality construct an SPQ aimed to measure were acceptable in all SPQs. Five SPQs reported measures of criterion validity (standard 7). For example, Tripod Educational Partners (2014) demonstrated the criterion validity of the 7Cs framework by examining the correlation between the 7Cs and other commonly used measures of teaching effectiveness. Another example is Maulana et al. (2015), who examined predictive validity by evaluating the link between the scores on the questionnaire and student self-report of academic engagement. The criterion validity was acceptable in these studies.

# **5 Conclusion, Discussion, and Next Steps**

Generating valid and reliable measurements by means of a student perception questionnaire (SPQ) is not self-evident. However, as an overview of the SPQs that have been developed and of their quality has been lacking, users have been hampered in making deliberate choices with respect to which SPQ to use in their own context (for examples, see Chap. 8 by Wisniewski and Zierer, in this volume). From a scientific perspective, it is also valuable to evaluate the extent to which SPQs meet the general standards for measurement instruments. Therefore, this review gives an overview of SPQs that are available for measuring teaching quality in primary and secondary education. It provides information about the (psychometric) quality of the SPQs and about what constructs they measure. After conducting a systematic review, 22 SPQs were evaluated based on seven evaluation standards. The evaluation was conducted by two reviewers. After reviewing all SPQs separately, the reviewers came to a final judgment.

# *5.1 What Was Learned*

The results of this study show that the reviewers were most positive about the theoretical basis of the SPQs and about the quality of the SPQ materials. For most SPQs, the reviewers were positive about the availability of empirical evidence regarding the reliability and validity of scores (except criterion validity). Moreover, the scoring tools in most SPQs were based on theory, research, and/or national standards. According to the reviewers, most SPQs also had acceptable reliability and construct validity. This could be partly explained by the well-known publication bias, in which only acceptable and confirmatory results get written about in order to increase the probability of publication (Falagas & Vangelis, 2008).

However, norm information about the quality rating measures was often lacking and few sampling specifications were provided. Information about the features of the SPQs, if available, was often not presented in an accessible way by instrument developers (e.g., in a user manual) making it difficult for potential SPQ users to obtain an overview of the qualities of available SPQs and to decide which SPQs best fits their own context and intended use. This raises questions about the factors that may have caused this. For example:


# *5.2 Limitations of the Study*

It is important to note that the evaluation framework used in this study was based on the COTAN evaluation standards of Evers et al. (2010). These standards were developed for psychological tests in the social sciences and it is possible that the COTAN evaluation standards are not the best way to evaluate SPQs, although the conclusions drawn from the results would be similar in both cases.

Although a systematic search was conducted for all available SPQs, questionnaires could be missing from the final overview. Search terms were chosen, inclusion and exclusion criteria were formulated, and different search strings were used. Twentytwo questionnaires were found and evaluated in this review. With slightly different search strings, more or less extensive criteria, or with searching in different databases, other SPQs (e.g., in other languages) could have been found. This might also have to do with the timeframe during which the review was conducted. All information was gathered in 2016. Since multiple SPQs could still have been under development at that time, this review might not include all the information that is now available. However, there is no reason to believe that the conclusions of this review no longer apply.

# *5.3 Next Steps*

In line with Dobbelaer (2019), who reviewed classroom observation systems, the creation of an international database with all SPQs, that have been developed, their materials and the research into them, could make SPQ developers more aware of the value of their SPQ for others. If SPQ materials are accessible to others, SPQs can also be researched and developed further (even if the instrument developers themselves did not have the resources to do so), which could also help to reduce the need to constantly develop new SPQs.

Furthermore, a standardized evaluation framework with quality standards and research standards for SPQs, designed to evaluate the quality of SPQs and the empirical evidence for the reliability and validity of SPQ scores, could make instrument developers more aware of the complexity of SPQ development and the research that is needed (for potential users). It could also make it easier for potential SPQ users to compare (the empirical evidence regarding) SPQs. Moreover, independent evaluations of the SPQs using a standardized evaluation framework can provide potential users with the information they need to make a well-informed choice of a SPQ as well as to examine the aims of the SPQs, the operationalization and definitions of the constructs being measured, the different ways to use them and to make sure the desired information will be obtained.

# **Appendix: Search Terms**


66 H. Bijlsma

(continued)


(continued)

# **References**


an instrument to measure student-reported teaching behavior. *International Journal of Emotional Education, 5,* 5–30.


Wubbels, T., & Levy, J. (1883). *Do you know what you look like? Interpersonal relationships in education*. The Falmer Press.

**Hannah Bijlsma** is a researcher (Ph.D.) at the section of Teacher Professionalization at the University of Twente (The Netherlands) and as a primary school teacher (grade 1). Her research focuses on measuring teaching quality and on the use of student perceptions of teaching quality in school contexts. In 2016 she founded a professional association for academic primary school teachers, of which she has been chairman for about five years. She is now a board member of the International Congress for School Effectiveness and Improvement (ICSEI) and a board member of the EARLI SIG on School Effectiveness and Improvement.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 5 A Probabilistic Model for Feedback on Teachers' Instructional Effectiveness: Its Potential and the Challenge of Combining Multiple Perspectives**

#### **Rikkert van der Lans**

**Abstract** This chapter describes research into the validity of a teacher evaluation framework that was applied between 2012 and 2016 to provide feedback to Dutch secondary school teachers concerning their instructional effectiveness. In this research project, the acquisition of instructional effectiveness was conceptualized as unfolding along a continuum ranging from ineffective novice to effective expert instructor. Using advanced statistical models, teachers' current position on the continuum was estimated. This information was used to tailor feedback for professional development. Two instruments were applied to find teachers' current position on the continuum, namely the *International Comparative Assessment of Learning and Teaching* (ICALT) observation instrument and the *My Teacher–student questionnaire* (MTQ). This chapter highlights background theory and central concepts behind the project and it introduces the logic behind the statistical methods that were used to operationalize the continuum of instructional effectiveness. Specific attention is given to differences between students and observers in how they experience teachers' instructional effectiveness and the resulting disagreement in how they position teachers on the continuum. It is explained how this disagreement made feedback reports less actionable. The chapter then discusses evidence of two empirical studies that examined the disagreement from two methodological perspectives. Finally, it makes some tentative conclusions concerning the practical implications of the evidence.

**Keywords** Teacher evaluation · Teaching quality · Measures · Teacher development · Feedback

# **1 Introduction**

Tess is a school leader who must decide how to spend the school's resources on professional development. Today she has an assessment interview with John of whom

R. van der Lans (B)

LUMC-Curium - Child and Adolescent Psychiatry, Leiden, The Netherlands e-mail: r.m.van\_der\_lans@lumc.nl

she has just visited a lesson yesterday. Her rubric scorings signal some clear directions for improvement in the clarity of John's front class explanations, yet the student questionnaire administered one month ago gives few signs of poor explanation skills. Instead, the student results signal that John could improve on the interactivity of his instructions. Tess wants to use the interview to plan and guide John's further professionalization. However, how can she use the available information to provide John with actionable feedback that likely adds to John's teaching? Also, Tess knows that John wants to discuss opportunities to participate in further training. Yet, the evidence of John's instructional skill is inconclusive. Hence, on what grounds should she accept or refuse the request?

Imaginary situation based on conversations with school leaders and teachers. The last decade teacher evaluation has had a central position in policies aiming to improve educational quality in many countries (e.g., Doherty & Jacobs, 2013; Isoré, 2009; Nusche et al., 2014). In the Dutch context, the Ministry of Education published the "teacher agenda" which documented several challenges, objectives, and policy measures meant to increase the quality of the Dutch teacher workforce. One objective was to increase the frequency of performance evaluations in schools (Nusche et al., 2014; OECD, 2016). In the eyes of the policy-makers, performance evaluations were a means to turn schools into "learning organizations" by functioning as a yearly update and a reminder of teachers' and school leaders' commitment to increase educational quality. To realize this objective, the councils for primary (PO-raad) and secondary (VO-raad) education and the teacher labor unions together agreed to install a new differentiated payment system and to assign every teacher a personal professionalization budget (Nusche et al., 2014; OECD, 2016). This created an incentive for teachers to request for a performance evaluation interview to discuss evidence of instructional effectiveness. If evidence of effectiveness was insufficient to qualify for a salary raise, the teacher should be informed about steps and/or skills required to qualify. Teachers could use their personal professionalization budget to train these skills. In practice these policies implied that school leaders, like Tess, were confronted with the task to distinguish between "average", "good" and "excellent" teachers, to give teachers feedback about what they needed to learn, and to organize the conditions under which teachers can start to learn.

The research project described in this chapter took place within this context and examined statistical methods and models that could assist school leaders to distinguish between teachers in terms of their level of instructional effectiveness. Furthermore, the new methods needed to result in feedback that clearly indicated a specific direction for improvement. Instruments used to collect data were student questionnaires and classroom observations. These two instruments were chosen because they share the strength that they collect direct observations of teachers' classroom behavior (Darling-Hammond, 2013; Goe et al., 2008; Peterson, 2000). However, as the situation of Tess and John shows, the feedback resulting from the student questionnaire and the feedback resulting from the classroom observation instrument do not always agree.

The chapter starts with an introduction of central concepts and how these were operationalized. The applied methods are introduced at a conceptual level and it is discussed how the models relate to other statistical methods that are commonly applied. After the background section, the chapter focuses on the problem of agreement between feedback sampled with student and classroom observation instruments.

# **2 Background Theory and Definitions of Central Concepts**

In the sketch at the beginning of this chapter, Tess is wondering how she may use student questionnaires and classroom observation instruments as two complementary instruments to provide John with actionable feedback. This highlights some central concepts of this chapter, namely instructional effectiveness, improvement, and actionable feedback.

# *2.1 Instructional Effectiveness*

In this chapter, instructional effectiveness is viewed as an estimation of the degree to which teachers' classroom behavior is expected to give students the opportunity to maximize their learning potential. By stating that instructional effectiveness provides students with opportunities to learn it is clarified that instructional effectiveness is associated with, but not identical to, student achievement and school success which are realizations of these opportunities. Furthermore, the definition clarifies that instructional effectiveness is estimated meaning that any claim about it is surrounded by some level of uncertainty. The research described in this chapter has operationalized instructional effectiveness using two instruments, namely the International Comparative Assessment of Learning and Teaching (ICALT)—which is a classroom observation instrument—and the My Teacher questionnaire—which is a student questionnaire. The ICALT and My Teacher questionnaire instruments conceptualize an effective instructor as a teacher that scores high on six domains of instruction. The six domains are labeled "safe and stimulating learning climate", "efficiency of classroom management", "clear and structured explanations", "intensive and interactive instructions", "teaching students learning strategies", and the "adaptation of instructions to individual student needs". Table 1 details the conceptualizations of each domain. Several studies from different countries provide evidence suggesting that the items included in the ICALT and My Teacher questionnaire cluster according to these six domains (André et al., 2020; Maulana & Helms-Lorenz, 2016; van de Grift et al., 2011).


**Table 1** An overview of the six domain and their conceptualization

# *2.2 Improvement*

Another central term in this chapter is improvement, which suggests that teachers can learn or be trained to become more effective instructors. Analogous to Berliner's (2004) novice-expert continuum, the research project and evidence discussed here conceptualizes the improvement of instructional effectiveness as unfolding along a continuum ranging from completely ineffective instruction to completely effective instruction. To illustrate how we operationalized this continuum we start with a one-item example, "This teacher uses time efficiently". Research suggests that more effective instructors in general use time more efficiently than less effective instructors (Muijs et al., 2014). The *x*-axis in Fig. 1 visualizes the continuum of effective instruction. The *y*-axis represents the probability on a positive item score. In line with the above statement, Fig. 1 predicts that highly effective instructors have near 100% probability on positive scores and that low effective instructors have near 0% probability. Furthermore, only when teachers have acquired a certain level of instructional effectiveness, are they predicted to start to learn how to use time efficiently. This can be inferred from Fig. 1 by observing that the probability on positive scores starts to rise at a certain location on the continuum. Key to the perspective taken in this

**Fig. 1** Increase in probability on a positive response to the classroom observation item "This teacher uses time efficiently"

chapter is that training of teachers' instructional effectiveness is optimal when it is focused on items that match the teacher's location on the continuum.

#### **2.2.1 Relation of the Applied Relatively Novel Statistical Model to Other Statistical Models**

The proposition that all items measure instructional effectiveness are associated with a single continuum seems conflicting with research that groups items of instructional effectiveness according to dimensions of teaching quality which load on separate (statistical) factors (e.g., studies applying factor analysis). Figure 2 is used to discuss and visualize the relationship between the continuum discussed in this chapter and factor analysis results. Figure 2 again visualizes the continuum of instructional effectiveness, but now includes multiple items. Solid, dashed, and dotted lines indicate clusters of items that have high(er) inter-item correlations (i.e., load on seperate factors). The reader can move the icon directly below the Figure to logically derive this. For example, teachers positioned at the icon have an approximately 50% probability on positive scores on the dotted items, but near 0% probability on positive scores on the dashed and solid ones. When the teacher moves up the continuum, the probability on positive scores on the dashed items increases first, while the probability on the dotted items remains high and the solid items remains low. Thus, some teachers likely have high scores on dotted items and low on all other items, other teachers likely have high scores on the dotted and high on the dashed items but low on the solid items, and yet others likely have high scores on all items. However, it is unlikely that teachers score high on the solid but low on the other items. This scoring pattern, which in factor analytic literature is referred to as the simplex pattern, is

**Fig. 2** The continuum of instructional effectiveness in which indicators are grouped into three factors

detected by factor analysis as a sign that the dotted items and dashed items are in distinct clusters (interested readers may consult Browne [1992] or Jöreskog [1978] for further details). It is acknowledged that the Figure presents an oversimplification of the relationship of factor analysis with the continuum described in the chapter. For example, item slopes, i.e., the steepness with which the s-curved lines increase, are rarely exactly parallel. Such differences in item slope also impact on the assignment of items to factors. However, Fig. 2 illustrates the basic rationale of how different factors on a single continuum.

#### **2.2.2 The Sequence of Clusters Along the Continuum**

The research group at the University of Groningen has given much empirical attention to the ordering of these factors along the continuum of instructional effectiveness (e.g., Maulana et al., 2015a, 2015b; van de Grift et al., 2011, 2014; van der Lans et al., 2015, 2018, 2021). The results indicated an ordering of the factors in the sequence in which the factors are presented in Table 1, thus: (1) safe and stimulating learning climate, (2) efficient classroom management, (3) clear and structured explanations, (4) intensive and interactive instructions, (5) teaching students learning strategies, and (6) adaptions of instructions to the individual students' learning needs. The validity of this ordering was further corroborated by other research in Cyprus that applied the same statistical models to their own questionnaire and observation instruments and which reported broadly similar results (e.g., Kyriakides et al., 2009, 2018).

Based on these results, the author developed a feedback report to provide teachers with information of our best estimate of their current position on the continuum of instructional effectiveness and our best estimate of what the teacher could improve on next. Figure 3 presents two reports that were applied by the author to give teacher feedback. The left report concerns the classroom observation instrument and the right concerns the student questionnaire instrument. The reports show a table with three columns. In the column "level" are the six identified levels (or domains) of instructional effectiveness. The column "item" lists the items included in the instruments. Finally, the column, "teacher score" indicates what probably went well (darkest grey top area), what probably can be learnt next (lightest grey middle area), and what probably is beyond the teachers' competency to learn yet (grey lowest area). The Asterix indicates the exact teacher position on the continuum of instructional effectiveness.

# *2.3 Actionable Feedback*

The third central concept in this chapter is actionable feedback. Cannon and Witherspoon (2005) describe actionable feedback as feedback that leads to learning and increased performance. Evidence indicates that feedback is actionable when, (1) it is directed at the task—not the person, (2) it is unambiguous and specific, and (3) it has clear implications for action (Cannon & Witherspoon, 2005; Kluger & DeNisi, 1996). The aim was to design feedback reports such that they would assist the feedback giver, which presumably is a school leader or coach, to be actionable. Therefore, the reports emphasize on what the teacher does (e.g., the teacher involves students, explains clearly), it attempts to communicate as specific as possible about what went well and what can be improved. In addition, reports accompanied this information with specific implications that would now require action. Participating teachers found this approach informative and often recognized themselves. Nonetheless, the actionability of the feedback was hindered by various organizational and psychometric factors. Organizationally, schools lacked an infrastructure to support training activities at this level of precision. Though, this was not part of the research project discussed here, it is nonetheless important to mention. Psychometrically, the disagreement between students and observers created uncertainty about the reliability of the estimates. Take, for example, the two feedback reports in Fig. 3. The students on average positioned the teacher on the item "my teacher involves me in the lesson" and signal that the teacher should focus to improve on the clarity and structuredness of explanation. The classroom observer, however, positioned the teacher on the item "encourages students to apply what they have learnt" and signals that the teacher should improve on teaching students learning strategies. Moreover, the observer's report suggests no problems with the clarity and structuredness of the teacher's explanations. Hence, in case of disagreement the feedback reports no longer unambiguously communicate what went well and what domains of instruction needed improvement. Also, the implications for action were no longer clear.


**Fig. 3** ICALT (left) and My Teacher (right) feedback reports. Darkest-grey = probably skilled (top area), light grey (middle area) = try to learn now, grey (lowest area) = probably unskilled in (*Note* These two feedbacks are slightly adapted versions of the ones that were originally reported to teachers)

**Teaching** 

**skill score**

# **3 Prior Research on the Disagreement Between Classroom Observation and Student Questionnaires**

Prior research suggests that the disagreement between students and observers may be frequent and/or substantial. Studies documenting correlations between observation and survey measures mostly report modest correlations in the range of 0.15–0.30 (e.g., De Jong & Westerhof, 2001; Ferguson & Danielson, 2014; Howard et al., 1985; Martínez et al., 2016; Maulana & Helms-Lorenz, 2016). Designs varied considerably between studies, however. For example, De Jong and Westerhof (2001) report on the correlation between a classroom observation instrument and a student questionnaire that had considerably different factor structure, whereas Maulana and Helms-Lorenz report on the correlation of the ICALT and My Teacher questionnaire that have an overlapping factor structure. Because the study by Maulana and Helms-Lorenz (2016) applied the same instruments, their results are most relevant to the discussion in this chapter. They report a correlation of 0.26. This correlation was replicated in the data that was used in the research project that is reported on in this chapter. This modest correlation suggests that feedback reports of students and observers will more often disagree than agree. An exception to the above list of studies reporting modest correlations is the study by Murray (1983), who reports a correlation of 0.76. We will return on Murray's study somewhat later in this chapter.

# **4 Studying Evidence of Agreement and Disagreement Between Questionnaires and Classroom Observation Instruments**

Two perspectives can be taken to compare the feedback reports presented in Fig. 3. The first perspective focuses on the teacher's position and as we have seen this leads to the conclusion that the students and observers disagree. The alternative perspective focuses on the ordering of the items and domains on the continuum of instructional effectiveness. From this perspective the students and observers mostly agree. Both the classroom observation and student questionnaire feedback report start with items related to safe and stimulating learning climate and end with items related to teaching students learning strategies and adaption of instruction to individual students' learning needs. The only two domains that are ordered differently by the two methods are the just mentioned final two domains.

Van der Lans et al. (2019) went one step further and showed that the My Teacher– student questionnaire and the ICALT classroom observation instrument items can be concurrently calibrated on the same continuum of instructional effectiveness. Table 2 lists the joint item ordering mixing observation and questionnaire items. Items denoted with an "s" are student questionnaire items and items denoted with an "o" are classroom observation items.


**Table 2** Item ordering that resulted from the concurrent calibration of ICALT observation and My Teacher questionnaire items. This table was originally published in van der Lans et al. (2019) (O

(continued)


**Table 2** (continued)

Studying Table 2 teaches us that similarly phrased questionnaire and observation items occasionally have similar positions on the continuum. Examples are S39 "my teacher involves me in the lesson" (student) and O11 "this teacher involves all students in the lesson" (classroom observation) and: S17 "my teacher encourages me to think for myself" (student) and O19 "this teacher asks questions that encourage students to think" (classroom observation). However, the questionnaire and classroom observation instrument also contained several items that were instrument unique, but which nevertheless could be calibrated on the same continuum. The considerable overlap between the questionnaire and observation instrument has clear practical implications. Suppose that two reports in Fig. 3 would locate a teacher on items related to the same domain—e.g., both on the domain clear and structured explanation—then the observation and student feedback reports give identical suggestions for improvement and, thus, would be more actionable. That is, there is no thinkable scenario in which feedback reports are actionable and in which the ordering of teaching behaviors (items) on the continuum varies between the methods.

The combination of agreement in item ordering and disagreement in teacher location also has theoretical implications, because these findings do not fit well with most prior believes, theory, and hypotheses concerning the disagreement between questionnaire and observation instruments. For example, a long-standing tradition in educational psychology studies biases in instrument scores. Bias is generally examined by regressing the teacher scores on variables other than instructional effectiveness. Studies in the MET project, like Martínez et al. (2016), regressed the "teacher scores" on variables that were hypothesized to bias measurement. This resulted in a set of 'bias-corrected' teacher scores related to the student questionnaire and a set of 'bias-corrected' teacher scores related to the classroom observation instrument. These corrected scores were then correlated. However, the resulting correlations were similar to the correlations reported in studies that do not correct for bias (cf. Maulana & Helms-Lorenz, 2016; Martínez et al., 2016). More in general, hypotheses reflecting the believe that inferences based on scores need to be corrected for bias do not fit well with the evidence discussed so far. When scores obtained with the instruments are biased and not indicative of instructional effectiveness, then how can we explain the high similarity in the item ordering along the continuum.

The evidence also is difficult to align with another prominent hypothesis, namely the perspective-specific validity hypothesis stated by Kunter and Baumert (2006). Kunter and Baumert proposed that, despite disagreement, scores obtained with distinct instruments can be used to make valid inferences about teachers' instructional effectiveness, given that the instruments are well-designed and administered. Kunter and Baumert do not clearly define what they mean with "perspectives" and this makes it complex to empirically assess their hypothesis (see also Fauth et al., 2020). However, many seem to understand different perspectives as meaning that some instruments might be more sensitive to tap certain aspects of instructional effectiveness. The difference in sensitivity explains the modest correlation. Also, using multiple instruments could help to offset blind spots thereby allowing for a fuller and richer picture of instructional effectiveness. The current evidence is insufficient to completely verify this idea, but an analysis of the unique items in Table 2 provides surprisingly limited support for it. Take, for example, the item S3 "my teacher makes clear what I need to study for a test". Despite that this item mentions unique content ("what students need to study for a test" is not part of the ICALT observation list because observers usually cannot know this), the item S3 does not have a unique position on the continuum. We could leave out item S3 without losing much information about teachers' instructional effectiveness. As another example, take item S34 "my teacher checks whether I understood the subject matter". The phrasing "whether *I* understood" focuses on the individual student and such focus is not included in any of the classroom observation instrument items. Nonetheless, item S34 is very closely located to the item O12 "this teacher checks during instruction whether students have understood the subject matter" which measures the same content, but has observers focus on the "average" student in the class. In sum, the evidence provides no strong indications that differences between instruments in terms of item content and item focus can explain disagreement in how students and observers position teachers on the continuum of teaching effectiveness.

The disagreement in teachers' position on the continuum was examined in another study. Central in that study was the hypothesis that disagreement in teachers' position on the continuum changes as a function of measurement reliability (van der Lans, 2018). Central in the study were two claims of which the correctness was empirically assessed. First, it was claimed that the scores assigned by observers reflect the average student in the class. Therefore, agreement was expected to increase when classroom observation scores were correlated with class average questionnaire score, instead of scores assigned by a single student. Secondly, it was claimed that student responses to questionnaire items reflect the teachers' typical teaching across many lessons. Therefore, the agreement was expected to increase when classroom observation scores sampled from different lessons were averaged. The study applied generalizability theory to test these predictions and found support for both of them. The predicted correlation is lowest when scores of a single student's questionnaire are correlated with one classroom observation score of one single lesson and the more student questionnaires are sampled the higher the predicted correlation with the observation score of a single lesson becomes. Also, the predicted correlation increases when the observation scores are aggregated over multiple lessons, and, again, the more lessons are sampled the higher the predicted correlation. The study results suggest that the correlation between questionnaire and classroom observation instruments increases to 0.76 when the classroom observation scores concern an aggregate of seven different lesson visits performed by seven different observers and when the student questionnaire is administered in the same class and spans scores of 25 different students. This correlation of 0.76 was interesting because of its correspondence to the correlation reported by Murray (1983), which was also 0.76. Murray estimated that correlation based on the aggregate classroom observation score of six to eight lesson visits by three different observers and a student questionnaire administered in the same class and which score was aggregated over all students in the class. The increase in the expected correlation between the questionnaire and classroom observation instrument is graphically presented in Fig. 4. The *y*-axis in Fig. 4 gives the predicted correlation. The *x*-axis indicates the number of students in the class. The separate lines indicate how predictions differ when the number of classroom

**Fig. 4** Predicted increase in correlations (ρ) between the MTQ student questionnaire and ICALT classroom observation instrument for an increasing number of administered classroom observations. The correlations apply when questionnaires and classroom observations are performed within the same class and span no more than one school year. Van der Lans (2018) reports predictions related to other situations (e.g., questionnaires and observations spanning different classes)

observers and lesson moments sampled with the classroom observation instrument varies.

The implications of the results in Fig. 4 are not yet well understood. There are varying possible interpretations. One interpretation is that more valid inferences are made with the questionnaire and classroom observation instruments when scores are aggregated over many students and over many lessons and observers, respectively. This interpretation aligns well with studies suggesting that the reliability of single student questionnaire scores is unreliable (Marsh, 2007) and that classroom observations of one single lesson are unreliable snap-shots (Hill et al., 2012; Praetorius et al., 2014; van der Lans et al., 2016). This interpretation has considerable implications for the number of classroom observations and student questionnaires that need to be administered at schools. However, this interpretation does not align well with the rationale behind the hypotheses of van der Lans (2018). That is, the only reason why it is predicted that the correlation between the classroom observation instrument and the student questionnaire increases as a function of the number of lesson visits included in the aggregate is that the students are also expected to aggregate their experiences across many lessons when scoring the questionnaire. If questionnaires would be able to tap students' experiences about one particular lesson, then it would be predicted that the correlation is highest when the questionnaire results are correlated with classroom observations concerning that same particular lesson. This study was unable to empirically examine this, however. Similarly, the finding that questionnaire scores obtained with single students have low correlation with the classroom observation scores is, in the study by van der Lans (2018), explained by the assumption that classroom observers score the instructional effectiveness towards the "average" student. If observers would have been instructed to score the classroom observation items in relation to one particular student, the correlation is predicted to be highest for the particular student-observer dyad (compared to other dyads). Again, the study was unable to empirically examine this claim.

# **5 Discussion and Conclusion**

# *5.1 Potential Implication Teacher Evaluation in Schools*

Based on the above discussions, what can we advise Tess and John? What we can say is that the evidence so far suggests that single classroom visits rarely will show agreement with a single administration of the student questionnaire. Also, the evidence generally indicates that this has little to do with the interpretation of item content by students and observers. The item ordering estimated across all students is very similar to the item ordering across all observers. This does not imply that the quality of the item phrasing is unimportant, however. The evidence indicates that when items are well-formulated the students can score items related to the same domains of instructional effectiveness very similar. We might advise Tess to postpone the performance evaluation interview and administer some more classroom observations. The evidence presented in this chapter indicates that this increases the chance on agreement. However, it is not always possible to schedule additional classroom observations. Alternatively, we might advise Tess and John to focus on one result. Perhaps John wants to improve on his instructional effectiveness when teaching certain subject matter and because the classroom observation visit took place when John was teaching this particular subject matter, the classroom observation results are favored over the student questionnaire results. However, while this last advice might be intuitive to some, we must acknowledge that it is full of untested claims and hypotheses.

# *5.2 What to Do Next?*

One direction for future research concerns the construction of student questionnaires that can help us to make valid inferences about the instructional effectiveness of single lessons. The Impact! tool might be a potential example of such an instrument (Bijlsma et al., 2019). The alternative item phrasing of the Impact! questionnaire provides opportunities to assess the hypothesis that correlations between student questionnaires and classroom observation instruments are attenuated because students aggregate their experiences over many lessons when scoring regular questionnaire items.

Another direction for future research concerns the commonly shared understanding among researchers that some instruments have higher sensitivity to measure certain aspects/behaviors of instructional effectiveness and that using multiple instruments could help to offset blind spots. When instruments have blind spots concerning the measurement of instructional effectiveness, then we would expect "gaps" in the continuum of instructional effectiveness. Such "gaps" can only become visible when multiple instruments are concurrently calibrated to continuum. The resulting ordering in item positions may reveal that items of one instrument have unique locations on the continuum. The current evidence assessing this idea is very limited. Hopefully, the counterintuitive result—few evidence supporting the idea—motivates future researchers to improve on the study designs, content of the instruments and psychometric methods applied by van der Lans et al. (2019) to more thoroughly study this idea empirically.

# **References**

André, S., Maulana, R., Helms-Lorenz, M., Telli, S., Chun, S., Fernández-García, C. M., et al. (2020). Student perceptions in measuring teaching behavior across six countries: A multi-group confirmatory factor analysis approach to measurement invariance. *Frontiers in Psychology, 11,* 273.


**Rikkert van der Lans** is a Postdoctoral Researcher currently working at the Department of Child and Adolescent Psychiatry, LUMC-Curium Leiden (The Netherlands). Previously he worked as a Postdoctoral Researcher at the Department of Teacher Education of the University of Groningen, as a lecturer in methods and statistics at the Department of Educational Sciences (GION) of the University of Groningen, and as a lecturer in the field of psychometrics at the Department of Methods and Statistics of the University of Tilburg (The Netherlands). His research focuses on the evaluation of professional development and the use of questionnaire and observation data to inform decisions.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 6 Understanding (Dis)Agreement in Student Ratings of Teaching and the Quality of the Learning Environment**

# **Jonathan D. Schweig and José Felipe Martínez**

**Abstract** Student surveys are increasingly being used to collect information about important aspects of learning environments. Research shows that aggregate indicators from these surveys (e.g., school or classroom averages) are reliable and correlate with important climate indicators and with student outcomes. However, we know less about whether within-classroom or within-school variation in student survey responses may contain additional information about the learning environment beyond that conveyed by average indicators. This question is important in light of mounting evidence that the educational experiences of different students and student groups can vary, even within the same school or classroom, in terms of opportunities for participation, teacher expectations, or the quantity and quality of teacher–student interactions, among others. In this chapter, we offer an overview of literature from different fields examining consensus for constructing average indicators, and consider it alongside the key assumptions and consequences of measurement models and analytic methods commonly used to summarize student survey reports of instruction and learning environments. We also consider recent empirical evidence that variation in student survey responses within classrooms can reflect systematically different experiences related to features of the school or classroom, instructional practices, student background, or a combination of these, and that these differences can predict variation in important academic and social-emotional outcomes. In the final section, we discuss the implications for evaluation, policy, equity, and instructional improvement.

**Keywords** Student ratings · Learning environment · Teaching quality

J. D. Schweig (B)

RAND Corporation, Arlington, VA, USA e-mail: Jonathan\_Schweig@rand.org

J. F. Martínez University of California, Los Angeles, CA, USA e-mail: jfmtz@ucla.edu

# **1 Introduction**

Educators are increasingly turning to student surveys as a valuable source of information about important features of school and classroom learning environments, ranging from time on task and content coverage to more qualitative aspects of teaching e.g., the extent to which classes are well-managed, teachers foster student cognitive engagement, or students feel emotionally, physically, and intellectually safe (Baumert et al., 2010; Klieme et al., 2009; Pianta & Hamre, 2009). Considerable research shows that student survey reports can be aggregated into reliable indicators of constructs that have been variously identified in the literature with terms like *learning environment*, *classroom climate*, *instructional practice*, or *teaching quality*. These constructs may or may not be exchangeable across areas of study, but irrespective of terminology, the literature shows that student survey aggregates tend to correlate significantly with each other, with indicators derived through other methods (e.g., classroom observation), and with a range of desirable student outcomes. However, there is a gap in research investigating whether within-classroom or within-school variability in such student survey responses may offer additional information beyond that conveyed by average indicators. This question is important in light of emerging evidence that the educational experiences of individual students can vary considerably within schools, and even within the same classroom, including opportunities for student participation (Reinholz & Shah, 2018; Schweig et al., 2020), and the quantity and quality of teacher–student interactions (e.g., Connor et al., 2009), among others. In this chapter we review literature that examines aggregate survey indicators in different fields, and consider the key assumptions and consequences of various measurement models and analytic methods commonly used to summarize student survey reports of teaching. We then examine the growing literature that investigates the variability in student survey responses within classrooms and schools, and whether this variation may relate to educational experiences and outcomes. We illustrate the potential implications of this kind of variation using a hypothetical example case. In the final section, we discuss the implications of this research for evaluation policy and instructional improvement.

# **2 Student Surveys, Teaching, and the Learning Environment**

There are many reasons why educators are increasingly interested in student surveys as a source of information about learning environments. Perhaps most importantly, students can spend over 1,000 hours in their schools every year, and thus have unmatched depth and breadth of experience interacting with teachers and peers (Ferguson, 2012; Follman, 1992; Fraser, 2002). Students also provide a unique perspective compared to other reporters (Downer et al., 2015; Feldlaufer et al., 1988). Probing students about their perceptions of teaching and the learning environment acknowledges their voice (Bijlsma et al., 2019; Lincoln, 1995), and the significance of their school-based experiences (Fraser, 2002; Mitra, 2007). Second, a growing body of research suggests that students can provide trustworthy information about important aspects of the learning environment (Marsh, 2007). For example, surveybased aggregate indicators can reliably distinguish among instructional practices (Fauth et al., 2014; Kyriakides, 2005; Wagner et al., 2013), and aspects of teaching quality (e.g., Benton & Cashin, 2012). These aggregates are furthermore significantly and positively associated with other measures of teaching quality (e.g., Burniske & Meibaum, 2012; Kane & Staiger, 2012).

Like other measures, student survey responses can be susceptible to error (e.g., recall, inconsistency in interpretation; see e.g., Popham, 2013; van der Lans et al., 2015), bias (e.g., acquiescence), and halo effects (perceptions of one aspect of teaching influencing those of other aspects; see e.g., Fauth et al., 2014; Chap. 3 by Röhl and Rollett of this volume) that may influence their psychometric properties (see for example, Follman, 1992; Schweig, 2014; Wallace et al., 2016). Nevertheless, most existing studies suggest that these biases are generally small in magnitude and do not greatly influence comparisons across teachers or student groups, or how aggregates relate with one another and with external variables (Kane & Staiger, 2012; Vriesema & Gehlbach, 2019). Research also demonstrates that aggregated student survey responses are associated with important student outcomes including academic achievement (Durlak et al., 2011; Shindler et al., 2016), engagement (Christle et al., 2007), and self-efficacy and confidence (e.g., Fraser & McRobbie, 1995).

Student surveys also have the benefit of being cost-effective, relatively easy to administer, and feasible to use at scale (e.g., Balch, 2012; West et al., 2018). This is a particular advantage when contrasted with other commonly used methods for measuring teaching and the quality of the learning environment, including direct classroom observation. In large school districts, an observation system closely tied to professional development can require dozens of full-time positions, with yearly costs in the millions of dollars (Balch, 2012; Rothstein & Mathis, 2013). As a result, the use of student surveys has seen remarkable growth over the last two decades for evaluating educational interventions (Augustine et al., 2016; Gottfredson et al., 2005; Teh & Fraser, 1994), and monitoring and assessing educational programs and practices (Hamilton et al., 2019). In particular, student surveys are commonly used to inform teacher evaluation and accountability systems—summatively as input for setting actionable targets (Burniske & Meibaum, 2012; Little et al., 2009), or formatively to provide feedback and promote teacher reflection and instructional improvement (Bijlsma et al., 2019; Gehlbach et al., 2016; Wubbels & Brekelmans, 2005).

# **3 Psychological Climate, Organizational Climate, and Student Surveys**

In most contexts, schooling is an inherently social activity, and students typically experience schooling in organizational clusters (Bardach et al., 2019). The common pattern of student clustering within classrooms and schools presents challenges and choices in using surveys to understand teaching and the quality of the learning environment. One of the first choices is whether to focus the survey on understanding the personal perceptions and experiences of individual class members, or more broadly on shared elements of teaching quality relevant to the class or school as a whole (Bliese & Halverson, 1998; Den Brok et al., 2006; Echterhoff et al., 2009).

Surveys that aim to capture individual student interpretations of teaching quality or of the learning environment are described as reflecting *psychological climate*, and include items that ask for individual self-perceptions and personal beliefs (Glick, 1985; Maehr & Midgley, 1991). A long history of educational research suggests that psychological climate is a key proximal determinant of academic beliefs, behaviors, and emotions (Maehr & Midgley, 1991; Ryan & Grolnick, 1986). Because psychological climate variables treat individual perceptions as interpretable, it is appropriate to analyze them at the individual level (Stapleton et al., 2016), and differences among individual respondents are considered as substantively meaningful. Individuals can react in different ways to the same practices, procedures that seem fair to one individual might seem unfair to another individual, and so forth. Psychological climate variables can be aggregated to describe the composition of an organization (Sirotnik, 1980).

On the other hand, surveys that focus on the classroom or the school as a whole are described as reflecting *organizational climate* (see e.g., Lüdtke et al., 2009; Marsh et al., 2012), a concept that has a rich history in industrial and social psychology (Bliese & Halverson, 1998; Chan, 1998). Unlike psychological climate, organizational climate emerges from the collective perceptions of individuals as they experience policies, practices, and procedures (e.g., Hoy, 1990; Ostroff et al., 2003). Aggregating individual perceptions produces measures of organizational level phenomena (Sirotnik, 1980). These new variables can be interpreted to reflect an *overall* or shared perception of the environment (Lüdtke et al., 2009). The concept of organizational climate informs the design and use of many student surveys, which are typically directed toward students as a group, often asking for observations of the behavior of others (e.g., classmates, teachers; see Den Brok et al., 2006).

When conceived as measures of organizational climate, aggregating survey responses essentially positions students as informants or judges of a classroom or school level trait, similar to observers who would provide ratings using a standardized protocol. To illustrate this assumption, consider the following claims in Table 1 regarding three widely used student surveys.

Thus, while psychological climate variables treat interindividual differences as substantively interpretable, organizational climate variables emerge based on shared student experiences, and assume that students have similar mental images of their


**Table 1** Measurement claims for three widely used surveys

classroom or school (Fraser, 1998). Students in a particular classroom or school are treated as *exchangeable* (Lüdtke et al., 2009), and interindividual differences are treated as idiosyncratic measurement error. Lüdtke et al. (2006, p. 207) noted that in the ideal scenario, "each student would assign the same rating, such that the responses of students in the same class would be interchangeable." Because organizational climate variables treat individual perceptions as error, it is appropriate to analyze them at the classroom or school level (Stapleton et al., 2016). However, while the distinction between psychological and organizational variables is frequently drawn in the theoretical and methodological literature, much-applied literature does not explicitly or consistently consider student survey-based ratings of teaching quality and the learning environment as either psychological or organizational level measures (Lam et al., 2015; Schweig, 2014; Sirotnik, 1980). This in part reflects the fact that most student surveys occupy a gray area between these two classifications. On one hand classrooms and schools are shared spaces, students interact socially and build social relationships with their peers and with their teachers, and some aspects of teaching quality are more or less equally applicable to all students in the classroom (Lam et al., 2015; Urdan & Schoenfelder, 2006). At the same time, students' schoolbased experiences can and often do differ, making their responses not exchangeable; students are not objective, external observers, but active participants involved in complex interactions with other students, teachers, and features of the classroom and school environment. Teachers often interact with students through multiple modes and formats, both individually, and as a group (whole-class instruction, group work); and of course students interact directly with one another individually and as a group (Den Brok et al., 2006; Glick, 1985; Sirotnik, 1980).

# **4 Reporting Survey Results: Common Practices and Opportunities for Improvement**

In the previous section, we argued that research often does not explicitly state the measurement assumptions that underlie their use of student surveys. In particular, researchers are not always explicit about the unit of interest (e.g., the individual or the group), and what this implies for the interpretability of individual student responses. These issues also arise in how survey developers choose to summarize and report survey results. In practice, nearly all survey platforms report measures of teaching and the quality of the learning environment by aggregating individual student responses to create classroom-level or school-level scores. It is these aggregates that are subsequently communicated to stakeholders or practitioners through data dashboards or survey reports (Bradshaw, 2017; Panorama Education, 2015). These aggregates can reflect simple averages (Balch, 2012; Bijlsma et al., 2019), percentages of respondents that report a certain experience or behavior (Panorama Education, 2015), or more sophisticated statistical models (e.g., IRT, or other latent variable models, see e.g., Maulana et al., 2014).

Irrespective of whether the survey developers are interested in individual or school- or classroom-level variables, this approach to score reporting often does not include information about the variability of student responses within classrooms or schools (Chan, 1998; Lüdtke et al., 2006). Thus, whether by accident or design, survey reports are ultimately firmly rooted in the notion of organizational climate in industrial or organizational psychology described previously: the shared learning environment is the central substantive focus, students are assumed to react similarly to similar external stimuli, and individual variation is assumed idiosyncratic or reflective of random measurement error (Chan, 1998; Lüdtke et al., 2009; Marsh et al., 2012).

However, while aggregated scores are useful for characterizing the overall learning experiences of a typical student, a growing body of research shows that these experiences can in fact vary greatly within schools and classrooms. Croninger and Valli (2009), for example, found that the vast majority (more than 80 percent) of the variance in the quality of spoken teacher–student exchanges occurred among lessons delivered by the same teachers. Den Brok et al. (2006) found that the majority of variance in student survey reports reflects differences among students within the same classroom (between 60 and 80 percent of the total variance). Crucially, emerging research also suggests that disagreement among students in their reports of the learning environment does not reflect only error, and indeed can provide important additional insights into teaching and learning, not captured by classroom or school aggregates. In a study of elementary school students, Griffith (2000) found that schools with higher levels of agreement in student and parent survey reports of order and discipline tended to have higher levels of student achievement and parent engagement. Recent work by Bardach and colleagues (2019) found that withinclassroom consensus on student reports of classroom goal structures was positively associated with socio-emotional and academic outcomes.

# *4.1 An Example Case of Within-Classroom Variability*

Examining the distribution of student reports can open up possibilities for using information about the nature and extent of student disagreements for diagnostic and formative uses, and focused professional development opportunities for teachers, among others. The three hypothetical Classrooms in Fig. 1 illustrate how different within-classroom distributions can produce the same aggregate classroom climate rating (e.g., Lindell & Brandt, 2000; Lüdtke et al., 2006).

For the purposes of this example, students in each of these classrooms are asked about their perceptions of cognitive activation in the classroom, and the extent to which they are presented with questions that encourage them to think thoroughly and explain their thinking (Lipowski et al., 2009). Figure 1 displays the ratings provided by twenty students in each of the three classrooms. All three classrooms have the same average score of 3.42 on a 5-point scale.

In Classroom 1 there is noticeable disagreement in student survey responses, and students provide responses all across the allowable score range. In Classroom 2, there is also a lot of variability in student responses, but student perceptions seem polarized: there is a large group of students that feel very positively about the level

**Fig. 1** Three hypothetical distributions of student climate ratings yielding the same average of 3.42

of cognitive activation, while a large group of students feel very negatively. Finally, in Classroom 3, there is perfect agreement among all students—this is the hypothetical ideal classroom described in Lüdtke and colleagues (2006) where all students experience classroom climate the same way. These scenarios raise important questions for practice. In principle it does not seem justifiable to give the three schools in Fig. 1 the same feedback and professional development recommendations for teachers—thus omitting the fact that the patterns of within-classroom variation are dramatically different. A more sensible approach would likely entail considering whether the within-classroom variability in student reports can potentially be informative for purposes of diagnosing and improving teaching quality. It is not possible to determine from this raw quantitative display *why* students in these three classrooms perceived cognitive activation in different ways. However, examining the distribution of student reports can open up possibilities for using this information for diagnostic and formative uses, and focused professional development opportunities for teachers. In the remainder of this chapter, we summarize and discuss relevant literature for understanding these interindividual differences.

# **5 School and Classroom Factors Associated with Variation in Student Perceptions of Teaching Quality**

Within classrooms or schools, interindividual differences in the perception of teaching or the learning environment can arise for many reasons.We begin this section discussing the standard assumption invoked by common approaches to survey score reporting (that within-classroom or school variation reflects measurement error) and subsequently present four alternative interpretations that have support in the literature in other areas: (1) differential expectations and teacher treatment, (2) diversity of student needs and expectations, (3) diversity of student backgrounds, experiences, cultural values, and norms, and (4) teacher characteristics.

# *5.1 Measurement Error*

Interindividual variability in student perceptions can be assumed to involve some idiosyncratic component of measurement error—i.e., random fluctuations around the "true" score of a school or classroom, related to memory, inconsistency, and unpredictable interactions among time, location, and personal factors. Individual students may also vary in terms of their standards of comparison (Heine et al., 2002), or the internal scales they use to calibrate their perceptions (Guion, 1973). This can create differences in student scores analogous to rater effects in studies of observational protocols: some students may be more lenient or severe than others. Thus, some differences among students are not substantively interpretable (Marsh et al., 2012; Stapleton et al., 2016), Moreover, to the extent students are not systematically sorted into classrooms based on *stringency*, these differences are not expected to induce bias and are best treated as measurement error (West et al., 2018). If interindividual variability were idiosyncratic and random, however, we would generally not expect within-classroom student ratings to be associated with other measures of teaching quality or student outcomes. However, a number of prior studies have demonstrated that individual perceptions of school or classroom climate can be positively associated with student achievement. Griffith (2000) and Schweig (2016), found that learning environments with more intraindividual disagreements about order, discipline, and the quality of classroom management had lower academic performance, even holding average ratings constant. Schenke et al. (2018) found that lower levels of heterogeneity among students' perceptions of emotional support, autonomy support, and performance focus are negatively associated with mathematics achievement. Martínez (2012) found that individual perceptions of opportunity to learn (OTL) were predictive of reading achievement, even after controlling for class and school level OTL. Such findings strongly suggest that within-classroom variability in student reports is not entirely reflective of measurement error.

# *5.2 Differential Expectations and Teacher Treatment*

Teacher expectations are a critical determinant of student learning (Muijs et al., 2014). Teachers may consciously or unconsciously have differential expectations for subgroups of students, which may translate into different sets of rules, classroom environments, and pedagogical strategies (Babad, 1993; Brophy & Good, 1974), potentially leading to opportunity gaps (Flores, 2007). Research has shown some teachers can have lower achievement expectations for students of color (Banks & Banks, 1995; Oakes, 1990). Teachers may also have lower achievement expectations for female students (Lazarides & Watt, 2015), and offer them less reinforcement and feedback (e.g., Simpson & Erickson, 1983). Teacher expectations may also differ based on perceptions of student ability. At higher grades, research has shown that prior academic achievement is the most significant influence on teacher expectations (Lockheed, 1976). More recent research suggests that learning tasks are often differentially assigned to students based on teacher beliefs about student ability. For example, "mathematically rich" instruction (tasks requiring reasoning and creativity, multiple concepts and methods, and application to novel contexts) is often reserved for students perceived to be *high*-*achieving*, while those perceived as lower achieving spend more time developing and practicing basic skills (Schweig et al., 2020; Stipek et al., 2001). Thus, within-classroom variability in student survey reports could point to suboptimal or inequitable participation opportunities and instructional experiences for students of different groups (Gamoran & Weinstein, 1998; Seidel, 2006), which may, in turn, result in achievement gaps (Voight et al., 2015).

# *5.3 Diversity of Student Needs and Expectations*

Student perceptions of teaching and the learning environment may reflect different student needs and expectations—learning experiences and instructional practices that are successful with some students may not be effective with others, and student socioemotional needs and expectations may also differ substantially within classrooms. Levy et al. (2003) provide an example that students with lower self-esteem may have greater needs with respect to the establishment of a supportive climate. Lüdtke et al. (2006) suggest that higher and lower ability students may differ in their perceptions of certain aspects of instructional practice, including pacing or task difficulty. English learners (ELs) and students with disabilities tend to report their schools to be less safe and supportive than their peers (Crosnoe, 2005; De Boer et al., 2013; Watkins & Melde, 2009). ELs face challenges with language comprehension, particularly with academic or mathematical language (Freeman & Crawford, 2008), and this may create differential perceptions of the clarity of classroom procedures. On the other hand, Hough and colleagues (2017) found that ELs had systematically more favorable perceptions of their teachers and classrooms than their peers on several aspects of climate. ELs students could be more engaged, more challenged, and better behaved, which influences their overall perception of the classroom (LeClair et al., 2009). In this way, ELs could also be more proactive at seeking out additional support from teachers, or that teachers are particularly sensitive to the needs of ELs (LeClair et al., 2009).

Alternatively, teachers may use instructional strategies that are responsive to and supportive of students' diverse needs and expectations, potentially causing student perceptions of the quality of their learning experiences to be *more similar*. For example, teachers may use complex instruction structured to promote student engagement, support critical thinking, and to connect content in meaningful ways to students' lives (Averill et al., 2009; Freeman & Crawford, 2008). Thus, to the extent that within-classroom agreement is associated with the use of instructional strategies responsive to students' diverse needs and expectations, there may be more equitable opportunities for all students. In a recent mixed-methods study of science classrooms, we found that classrooms with higher levels of student agreement tended to provide more collaborative learning opportunities for students, including more group work, and to have more structured systems for eliciting student participation (Schweig et al., n.d.).

# *5.4 Diversity of Student Backgrounds, Experiences, Cultural Values, and Norms*

Reports of teaching and the quality of the learning environment may reflect cultural or contextual factors that cause students to perceive the learning environment differently (Bankston & Zhou, 2002; West et al., 2018). There is also research suggesting that student perceptions of the learning environment may also differ by grade level (West et al., 2018). In the United States, research has shown that Black and Hispanic/Latino students often report feeling less connected to their schools, feel less positively about their relationships with teachers and administrators, and feel less safe in some areas of the school (Lacoe, 2015; Voight et al., 2015). However, recent literature suggests that this may not always be the case. Hough and colleagues (2017) found that while Black students had systematically lower ratings of school connectedness, discipline, and safety than their peers, Hispanic/Latino students tended to report systematically higher perceptions. These findings are not inherently at odds, and other literature suggests that perceptions of the learning environment can differ even from one area of the school to the other. Using data from New York City, Lacoe (2015) found that, for example, Black students have systematically lower perceptions of safety than their white peers in classrooms, but have systematically higher perceptions of safety in hallways, bathrooms, and locker rooms. In our own work, we found that classes with higher proportions of ELs and low-achieving students tended to have more intraindividual disagreements about teaching and the quality of the learning environment (Schweig, 2016; Schweig et al., 2017) in mathematics and science classrooms, and we also found significant within-classroom gaps between Black and white students on several aspects of teaching and the quality of the learning environment, with Black students typically having more positive perceptions relative to their white peers (Perera & Schweig, 2019).

The perception of some teacher behaviors, including the extent to which teachers make students feel cared for, may depend strongly on cultural conceptions of caring (Garza, 2009). Calarco (2011) highlighted several ways in which economically disadvantaged students help-seeking behaviors differed from their classmates in ways that could impact perceptions of teaching quality. Specifically, Calarco found that economically disadvantaged students sought less teacher assistance, and as a result, received less guidance from their teachers. Atlay and colleagues (2019) found that students from higher socioeconomic backgrounds were more critical about teacher assistance, perhaps reflecting a sense of entitlement (Lareau, 2002). Students' perceptions of teaching quality can also be influenced by out-of-school experiences. For example, there may be differential exposure to external stressors that influence feelings of school safety (Bankston & Zhou, 2002; Lareau & Horvat, 1999).

# *5.5 Teacher Characteristics*

A number of teacher characteristics can influence survey-based reports. Past work, for example, has shown that student perceptions of teachers are associated with teacher experience, and in particular, that more experienced teachers are perceived as more dominant and strict (Levy & Wubbels, 1992). More experienced teachers, however, are not generally perceived as more caring or supportive by their students (Den Brok et al., 2006; Levy et al., 2003). Teacher race and ethnicity can also play a role in survey-based ratings of teaching quality. Newly emerging research suggests that race-based disparities in perceptions of teaching quality can be ameliorated by the presence of teachers of color. Specifically, teacher–student race congruence may positively influence students' perceptions of teaching quality (Dee, 2005; Gershenson et al., 2016). In our own research, however, we did not find evidence that observable teacher characteristics, including teacher race, gender, years of experience, and level of education explain variation in race-based perceptual gaps (Perera & Schweig, 2019; Schweig, 2016).

# **6 Conclusion**

A growing body of evidence suggests that in considering instructional climate, researchers and school leaders may want to look beyond aggregate indicators, and consider also the extent of variation (or consensus) in student survey reports, as a potential indicator of important aspects of the school or classroom environment. In fact, the ability to capture within-school or within-classroom variability in student experiences is one of the defining strengths of student survey-based measures. Other commonly used measurement modes (including teacher self-report and structured classroom observations) are structurally not well-equipped to capture differential student experiences. Classroom observation protocols, for example, are typically not designed to measure whether or how teachers engage with individual students (Cohen & Goldhaber, 2016; Douglas, 2009). Student surveys, on the other hand, offer information that goes beyond typical experiences and can allow teachers and instructional leaders better understand how instruction, socio-emotional support, and other aspects of the learning environment are experienced by different students or groups of students.

Collectively, the research presented in this chapter suggests that variation in student survey reports of their learning environment may reflect a variety of factors and influences, ranging from strategic instructional choices, responsive pedagogy, and classroom structures implemented by teachers, varying needs and perceptions of particular students or groups of students, contextual factors, and the interactions among these. Importantly, variation can also reflect more pernicious influences like differential teacher expectations, and other structural disadvantages for some group of students. Our example case also raises important questions about whether withinschool or within-classroom variability should be considered as ignorable measurement error when examining student survey reports of teaching quality and learning environments. Should we give the three classrooms in Fig. 1 the same feedback and professional development recommendations for teachers? Or is there evidence in the within-classroom variability in student reports that can potentially be informative for these purposes? Recent policy guidelines in the United States either explicitly require or implicitly move in the latter direction, advising education agencies to provide schools not only aggregated survey-based indicators, but also indicators disaggregated by student subgroup (Holahan & Batey, 2019; Voight et al., 2015). A growing consensus also sees attending to these subgroup differences as a key for school-wide adoption of instructional improvement strategies that meet the learning needs of the most vulnerable students (Kostyo et al., 2018).

Considering the diversity of student perspectives and experiences can be particularly useful for informing efforts to promote equitable learning and outcomes. Ultimately, whether the climate is conceived as a psychological or organizational climate, or both, if subgroups of students experience school life in meaningfully different ways, reliance on aggregated survey indicators as measures of teaching quality can potentially obscure diagnostic information (Roberts et al., 1978), and compromise the validity and utility of these measures to inform teacher reflection or feedback, and other improvement processes within schools (Gehlbach, 2015; Lüdtke et al., 2006).

# **References**


Education. https://edpolicyinca.org/sites/default/files/SEL-CC\_report.pdf. Accessed 5 August 2020.


**Jonathan D. Schweig** is a Social Scientist at the RAND Corporation (USA). His research focuses on the measurement of teaching quality, and features the learning environment, the development and implementation of teacher evaluation policies, and student social and emotional learning. His recent publications focus on the implications which methodological choices can have on inferences about instructional practice, classroom climate, and program effectiveness. Jonathan earned an M.A. in Curriculum and Teacher Education from Stanford University, an M.S. in Statistics from the University of California, Los Angeles, and a Ph.D. in Education from the University of California, Los Angeles (USA).

**José Felipe Martínez** is an Associate Professor of Social Research Methodology at UCLA's School of Education and Information Studies (USA). His research examines the application of measurement theory and methods to issues in education policy, specifically in relation to teacher, school, and program evaluation. A particular focus comprises instruments and tools for measuring classroom climate and instructional practice in mathematics and science, including teacher portfolios, student surveys, and observation protocols. His work has been supported by the National Science Foundation, and the Spencer, WT Grant, and McDonnell foundations. Dr. Martinez teaches courses on measurement, research design, and survey methodology.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 7 Student Ratings of Teaching Quality Dimensions: Empirical Findings and Future Directions**

## **Richard Göllner, Benjamin Fauth, and Wolfgang Wagner**

**Abstract** This chapter discusses current issues in research on the validity of student ratings of teaching quality. We first discuss the advantages and limitations of student ratings of teaching quality based on theoretical considerations regarding the teaching quality concept. Research reveals that the validity of student ratings differs depending on the aspect of teaching quality being rated (i.e., classroom management, cognitive activation, or student support). Extending this research, we propose that future studies on the validity of student ratings should take into account students' cognitive processing while responding to survey items. We discuss three areas that seem promising for future research: the complexity and comprehensibility of survey items, the referent and addressee of items, and finally, the idiosyncratic nature of student ratings.

**Keywords** Student ratings · Teaching quality · Dimensions · Validity · Theoretical considerations

# **1 Introduction**

Assuring reliable and valid measures is a key issue in assessing teaching quality in schools or classrooms for evaluative purposes. In general, student ratings represent a promising way to evaluate teaching because they provide firsthand impressions and are more efficient in assessing teaching quality than alternatives such as classroom observations. On the other hand, however, scholars have expressed concerns about

R. Göllner (B) · W. Wagner

Hector Research Institute of Education Sciences and Psychology, University of Tübingen, Tübingen, Germany

e-mail: richard.goellner@uni-tuebingen.de

W. Wagner e-mail: wolfgang.wagner@uni-tuebingen.de

B. Fauth Institute for Educational Analysis (IBBW), Stuttgart, Germany e-mail: Benjamin.Fauth@ibbw.kv.bwl.de

students' ability to provide reliable and valid information about teaching quality. In the following chapter, we first describe a common framework of teaching quality and then present recent findings on the differential validity of student ratings for conceptually different aspects of teaching quality. Finally, we show that the way in which students are asked about teaching quality in surveys raises awareness of the potential and limitations of student ratings and can help us identify existing gaps in the field of teaching quality research.

# **2 The Concept of Teaching Quality**

Teaching quality is widely understood as rooted in a teacher's actual behavior, but it is also influenced by student–teacher interactions (Doyle, 2013; Fauth et al., 2020b; Göllner et al., 2020; Hamre & Pianta, 2010; Kunter et al., 2013). Thus, conceptually, teaching quality refers to teacher behavior in the classroom as well as students' reactions to this behavior and vice versa. One implication of this is that the context and conditions in which teaching takes place always need to be considered. Teaching quality has been described and assessed in a number of different frameworks, many of which show a great deal of overlap (e.g., Creemers & Kyriakides, 2008; Danielson, 2007; Pianta et al., 2008). A very common conception of teaching quality subdivides it into three superordinate quality domains, namely classroom management, teachers' learning support, and cognitive activation (see Hamre & Pianta, 2010; Praetorius et al., 2018). Classroom management has traditionally been seen as a central element of good teaching and has an important place in many conceptualizations of teaching quality. Relevant characteristics include a lack of student misbehavior and effective management of time and classroom routines (Evertson & Weinstein, 2006). Student support is based on a positive student–teacher relationship and a learning environment in which, for example, students are given constructive feedback on how to improve their performance or see the subject matter as more relevant (Brophy, 2000). Finally, cognitive activation encompasses, for example, providing challenging tasks that clarify the connection between different concepts or link new learning content to prior knowledge (e.g., Kunter et al., 2013). These aspects of quality have received substantial empirical attention in recent years. Most importantly for the present chapter, they serve as the foundation for survey instruments and observation protocols that can then be used to examine the empirical relevance of teaching quality for students' achievement and learning-related outcomes (e.g., students' interest, motivation, self-efficacy; e.g., Kunter et al., 2013).

# **3 Why Should Student Ratings Be Used to Assess Teaching Quality?**

Teaching quality—in terms of teachers' classroom management, the support teachers provide to students, or the extent to which learning is cognitively demanding—can be assessed in different ways, each of which entails a number of advantages and disadvantages (Derry et al., 2010; Desimone et al., 2010; Fraser & Walberg, 1991; Wubbels et al., 1992). For instance, classroom observations are viewed as the gold standard in teaching quality research. They are considered the most objective method of measuring teaching practices and represent a central element in teacher training (Pianta et al., 2008). On the other hand, it is widely recognized that classroom observation is not without problems. Observers need to be specially trained, their observations provide only snapshots, and it is unclear whether the presence of observers systematically changes the behavior of teachers and students (e.g., Derry et al., 2010).

In contrast to classroom observations, student ratings of teaching quality are much easier to obtain. They are considered to be more cost effective, and they are directly tied to students' day-to-day classroom experiences. Moreover, they are not merely the result of a single or quite limited number of observations, and they ensure a reliable assessment of teaching quality (Lüdtke et al., 2009). Research has shown that the psychometric properties of a class' average teaching quality perceptions are not systematically inferior to those from observational measures (e.g., Clausen, 2002; de Jong & Westerhof, 2001; Maulana & Helms-Lorenz, 2016). In addition, there is empirical evidence that students are able to provide valid ratings of teaching quality, although differences between quality dimensions need to be taken into account (Fauth et al., 2014; Kuhfeld, 2017; Nelson et al., 2014; Schweig, 2014; Wagner et al., 2013; Wallace et al., 2016; see also Chap. 5 by van der Lans in this volume). Specifically, previous research has shown that student ratings of classroom management typically emerge as a clearly identifiable teaching quality aspect, which exhibits significant associations with observational as well as teacher self-report data and predicts students' learning in terms of their achievement, interest, and motivation (e.g., Kunter et al., 2007; Lipowsky et al., 2009). Furthermore, student ratings of classroom management are comparable across different learning contexts (e.g., different school subjects; Wagner et al., 2013) and even reveal time-specificity. That is, student ratings have proven to be sensitive enough to capture differences in teachers' classroom management over the course of several weeks or months (Wagner et al., 2016). In contrast, the psychometric properties of student ratings of learning support and cognitive activation are less clear. In the case of cognitive activation, this is because measures cannot be generally applied to all subjects but need to reflect the specificity and requirements of each individual subject (e.g., mathematics, languages, the arts, etc.). Consequently, the majority of existing student surveys of teaching quality do not include cognitive activation measures, making it much harder to evaluate the validity of student ratings with respect to this dimension. Nevertheless, the few studies that do exist show that even ratings by primary school students reveal substantial differences in cognitive activation between classrooms. In addition, cognitive activation ratings have been shown to be separable from classroom management ratings and to a lesser extent from learning support ratings, and to be statistically significant associations with student learning outcomes (e.g., subject-related interest; Fauth et al., 2014). The situation for student ratings of learning support is even more complex. Previous research has shown that student ratings of learning support exhibit relatively low agreement with classroom observations and even low agreement across students in the same classroom. One potential explanation for this is that students' perceptions of teachers' learning support do not exclusively function as a quality characteristic that differs across classrooms but are also affected by students' individual experiences within classrooms (Aldrup et al., 2018; Atlay et al., 2019; den Brok et al., 2006a; Göllner et al., 2018). For a long time, these within-classroom differences were considered the result of factors external to teaching quality, such as students' rating tendencies (e.g., harshness or leniency) or perceptual mindsets (e.g., halo error; e.g., Lance et al., 1994). However, recent research has shown that these differences can also reflect effects stemming from the dyadic relationships between each individual student and his or her teacher. Specifically, a recent study by Göllner and colleagues (2018) used national longitudinal data from the Program for International Student Assessment (PISA) database and showed that rating differences in student perceptions of learning support partially result from teacher-independent rater tendencies, but also reflect the dyadic relationship between an individual student and one specific teacher. Therefore, students' ratings of teaching quality provide important information about their individual experiences in their classroom learning environments.

# **4 Future Directions for the Use of Students' Ratings**

Although student ratings of teaching quality have become a prominent way to obtain student feedback on teaching quality in schools and classrooms, scholars and practitioners have also criticized their use in both summative and formative assessments (Abrami et al., 2007; Benton & Cashin, 2012). They emphasize the specific nature of student ratings, as students are not trained to provide valid assessments of teaching quality in the same way as adult observers. Thus, it is important to acknowledge potential limitations of student ratings, which raises the question of how student ratings for evaluative purposes can be improved. We believe that a more detailed examination of existing survey instruments can be a fruitful approach to finding out how student ratings work and what we can do to achieve reliable and valid ratings. From a very general perspective, a student survey can be seen as ordinary text material (i.e., textual information presented in the form of separate items), requiring students to read and interpret a question to understand what is meant, retrieve the requested information from memory, and form a judgment based on their knowledge and expertise (Tourangeau et al., 2000). Building upon this foundation, this chapter presents three areas of recent research that might help provide a deeper understanding of students' teaching quality rating and exploit future research directions.

# *4.1 Complexity and Comprehensibility*

At first glance, existing student surveys fundamentally differ in their linguistic complexity, which shapes student responses (e.g., Krosnick & Presser, 2010; Tourangeau et al., 2000). It is surprising to see that even frequently used surveys are linguistically challenging, particularly for younger respondents (e.g., Fauth et al., 2014; Wagner et al., 2013). Consequently, it can be argued that many reporting problems (i.e., low interrater agreement) arise because students encounter difficulties in comprehending the survey. Survey items include many linguistic features, including surface aspects (e.g., the length of words and sentences) and characteristics that require more linguistic analysis (e.g., the number of complex noun phrases). For example, the following items might be used to assess teachers' sensitivity to and awareness of students' level of academic functioning: "In math, the individual students often do different tasks" and "In math lessons, the teacher asks different questions, depending on how able the student is." However, the items differ in their linguistic characteristics: number of words (9 vs. 15), structure of sentences (1 vs. 2 clauses), average word length (5.33 characters vs. 5.00 characters), and number of complex noun phrases per clause (2 vs. 0.5). In addition, students may be less familiar with certain words used in the items (e.g., "individual," "depending") or have to make many interpretations because single words do not refer to specific, denotable, and relatively objective behavior (i.e., high-inference ratings; e.g., Roch et al., 2009; Rosenshine, 1970). Despite the large body of literature on traditional best practices in the construction of survey questions (see Krosnick & Presser, 2010), only a few studies have examined the impact of these and other linguistic characteristics on student surveys' ability to reliably and validly assess teaching quality. One of these studies showed that the use of measures with a lower specificity and higher level of abstraction (high-inference ratings) leads to higher interrater reliability in student ratings, but lower agreement with expert assessments. Contrary to common expectations, rater agreement increased as the behavioral observability of the measures decreased (Roch et al., 2009). The authors argue that raters might compensate for uncertainty in high-inference ratings by more strongly adjusting their ratings to match their general impression, which might in turn be unrelated or only partially related to the teaching quality dimension in question. Such findings impressively demonstrate that the association between linguistic features and psychometric properties of student ratings is anything but trivial, and a more rigorous consideration of linguistic forms in existing surveys is needed.

# *4.2 Framing*

Student surveys also differ in characteristics apart from linguistic complexity. Specifically, the referent and addressee of survey items are two salient characteristics that might affect the information obtained from student ratings of teaching quality but received less attention in research on student perceptions of teaching quality (den Brok et al., 2004, 2006b; McRobbie et al., 1998). The referent can be defined as the subject to which an item refers. At first glance, student rating items that refer more to the classroom (e.g., "In math class, the lesson is often disrupted") than to the teacher (e.g., "Our math teacher always knows exactly what is happening in class") tend to exhibit more favorable psychometric properties in terms of interrater agreement or distinctiveness from other theoretically relevant aspects of teaching quality (see Fauth et al., 2020a; Göllner et al., 2020). However, the use of surveys that refer more to the classroom than to the teacher might result in serious constraints. First, items referring more to the classroom than to the teacher are frequently used to assess classroom management, but much rarer for items assessing learning support or cognitive activation. This raises the question of whether the well-established distinctiveness of classroom management compared to other quality aspects is also due to systematic differences in the referent used. Second, previous findings have shown that when classroom management items refer to the classroom, measures are more prone to classroom composition effects (e.g., proportion of male students or performance composition). Even though existing analytical procedures can be used to account for such differences in classroom composition, it is unclear whether such analytical adjustments result in fair comparisons or relatively favor or penalize certain individual teachers. Irrespective of this, classroom management measures referring more to students than to the teacher need to be seen from an interactionist perspective that includes both teachers and students they teach (Fauth et al., 2020a). In addition, the target of the teacher's behavior that is addressed in a survey is important. In the simplest case, this can be either the responding student him/herself (e.g., "The teacher motivates me") or all students in the classroom (e.g., "The teacher motivates us"). An examination of existing surveys shows that the "me-addressee" is predominantly used when assessing the support teachers provide to students, whereas the "we-addressee" is more frequently used for classroom management and cognitive activation (e.g., BIJU, Baumert et al., 1996; Tripod survey; e.g., Prenzel et al., 2013; Wallace et al., 2016). At the same time, previous studies have shown that student support dimensions usually fail to predict student learning outcomes on the classroom level but are more consistent predictors at the individual student level (e.g., Aldrup et al., 2018). These results raise the question of whether support can be better conceptualized as a dyadic phenomenon between a teacher and an individual student or whether they merely reflect how teacher support is assessed. Experimentally varying the addressee for items assessing multiple teaching quality dimensions will enable us to examine whether the addressee affects the information obtained from student ratings at the student and classroom level. The findings might also be interesting for analytical modeling procedures used in teaching quality research. First, findings from multilevel models applied to separate students' shared (student level) and non-shared (classroom level) perceptions of teaching quality might be directly affected by the used addressee. Whereas the "me-addressee" assumed to provide valid information about students individual learning experiences at the student level, the "we-addressee" might be more adequate to give insights in students' learning at the classroom level; or in other words, one cannot simply assume that different item wordings can interchangeably be used at different levels of analysis (den Brok, 2001). Second, there is increased interest in more recent analytical procedures that model classroom heterogeneity in student ratings as an additional indicator of good teaching (e.g., Schenke et al., 2018). Applying these modeling procedures to surveys with a "me-addressee" might be a better way to assess student–teacher fit in classrooms and teacher adaptivity than surveys with a "we-addressee." If surveys with a "we-addressee" are considered, different levels of heterogeneity between classes might be more a reflection of class-specific measurement precision (i.e., more or less agreement in classes). In other words, the choice of addressee in surveys can be assumed to have very serious consequences for teaching quality assessment and the view we take on students' learning in classrooms.

# *4.3 The Idiosyncratic Nature of Student Ratings*

Finally, it is important to ask what we can fundamentally expect from student ratings of teaching quality and to what extent student ratings of teaching quality reveal idiosyncrasies, i.e., are systematically different from alternative methods. Even when we take special care to use comprehensible and age-appropriate surveys and make more intentional decisions about the referent and addressee in survey items, the specific nature of student ratings needs to be considered. One main objective of previous research has been to determine the degree of idiosyncrasy in student ratings by comparing them to alternative assessment methods (e.g., Clausen, 2002; Kunter & Baumert, 2006). This research has shown that student ratings, particularly those assessing learning support and cognitive activation, exhibit substantial differences to classroom observations or teacher self-report data, which might lead to the conclusion that students are less able to provide valid information on teaching quality and its theoretically proposed dimensions (e.g., Abrami et al., 2007). However, this focus on limitations and biases of student ratings bears the risk of neglecting the expertise students naturally acquire through their everyday experiences in classrooms. Thus, future research needs to better appreciate the unique information we obtain from student ratings (e.g., Leighton, 2019). In order to do so, however, we need to learn much more about the mental models that underlie students' ratings and the extent to which these models differ from those of adult observers evaluating teaching quality. A recent study by Jaekel et al., (2021) found that student ratings of teaching quality in one school subject (mathematics or German language) did not only result from students' daily experiences in the subject at hand, but were also affected by their experiences in the respective other subjects. Students seem to make use of comparative information when objective criteria for good teaching is not available. In addition, there is a need to understand how a developmental perspective can help us understand idiosyncrasies in student ratings of teaching quality. That is, it is reasonable to assume that student ratings of teaching quality are affected by the age-related developmental stages in which ratings take place. For instance, students' need to define their own identity and stronger need for autonomy during adolescence (e.g., Eccles et al., 1993) might function as a guiding perspective when students have to rate teaching quality. A recent study by Wallace and colleagues (2016) based on the Tripod survey identified two dimensions of students' ratings of teaching quality: one specific classroom management factor and one broad general factor. Interestingly, the quality indicators with the highest loadings on the general factor were indicators that clearly capture students' perceptions of teachers' learning support and student–teacher relationship (Schweig, 2014;Wallace et al., 2016). The same is true for student ratings of cognitive activation. It is interesting to note that even though cognitive activation is considered a central aspect explaining students' achievement, cognitive activation measures are much less common in existing surveys than classroom management or learning support measures. One major reason for this is that assessing teachers' ability to use stimulating learning materials, the quality of questions teachers ask during lessons, or the quality of classroom discussion from students perspective is seen as a particularly challenging task because it requires special knowledge and skills which is beyond students' firsthand experiences of participation in the classroom. Whether and to what extent students are really able to provide information on these and other aspects of cognitive activation in line with an adult view remains an open question that needs to be addressed in future research. As part of this process, we have to think about further refining existing measures that capture central aspects of cognitive activation in a wide variety of learning situations and by making more explicit use of other principles getting learners to learn long, complex, and difficult things. Alternative ways of conceptualizing and measuring effective learning contexts from related disciplines (e.g., discourse analysis in linguistic research; Turner & Meyer, 2000) or entirely different research fields (e.g., game-based learning; Gee, 2007) can provide a good foundation for improving existing cognitive activation measures.

# **5 Closing Remarks**

As the work we reviewed in this chapter makes clear, student ratings have become a vibrant part of teaching quality research. We are particularly excited about two aspects of this research. The first is the usefulness of student ratings in research and practice. Even though differences across teaching quality dimensions need to be considered, students can provide a valid perspective on teaching quality and are thus in no way generally inferior to alternative assessments such as classroom observations or teacher self-reports. Second, students provide a plethora of information on teaching quality at both the classroom and the student level, with the latter referring to students' individual learning experiences within a classroom in a way that is beyond the scope of alternative assessments. As research on student ratings progresses, it will be critical to take a deeper and more consequential look at the characteristics of existing surveys to determine what we can learn about teaching quality from the students' perspective. We look forward to participating in work on these topics in the future.

7 Student Ratings of Teaching Quality Dimensions … 119

# **References**


**Richard Göllner** is a Professor of Educational Effectiveness and Trajectories at the Hector Research Institute of Education Sciences and Psychology at the University of Tübingen (Germany). His work focuses on teaching quality, specifically on the measurement of instructional practice from an interdisciplinary perspective, and its impact on students' achievement. Furthermore, he is interested in students' personality development within schools and the use of simulated learning contexts in experimental research in education.

**Benjamin Fauth** is Head of the Department for Empirical Educational Research at the Institute for Educational Analysis (IBBW) in Stuttgart (Germany) and Associate Professor at the Hector Research Institute of Education Sciences and Psychology at the University of Tübingen (Germany). His research focuses on the quality of teaching, in particular questions of the theoretical conceptualization, assessment, and the impact of teaching quality. Furthermore, his research focuses on the professional competence of teachers and on questions of applied evaluation research.

**Wolfgang Wagner** studied psychology at the University of Koblenz-Landau (Germany) and now works as a research assistant at the Hector Research Institute of Education Sciences and Psychology at the University of Tübingen (Germany). His main research interests include the assessment of characteristics of learning environments and their effects on the development of targeted outcomes (in particular, academic achievement), as well as methodological issues in the field of (multilevel) latent variable models.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Part II Using Student Feedback for the Development of Teaching and Teachers**

# **Chapter 8 Functions and Success Conditions of Student Feedback in the Development of Teaching and Teachers**

**Benedikt Wisniewski and Klaus Zierer**

**Abstract** The term "student feedback" is often used synonymously with evaluation, assessment, or ratings of teaching, but can be conceptually delimitated from these concepts, distinguishing formative and summative aspects. Obtaining feedback is a core component of teachers' professional development. It is the basis for critical self-reflection, a prerequisite of reducing discrepancies between one's performance and set goals, a tool to identify blind spots, and a means of correcting false selfassessments. Student feedback opens up opportunities for teachers to improve on their teaching by comparing students' perspectives on instructional quality to their own perspectives. Feedback can also help teachers to implement democratic principles, and experience self-efficacy. Conditions are discussed that need to be fulfilled for student feedback to be successful.

**Keywords** Student feedback · Professional development · Democratization · Teacher satisfaction

# **1 Introduction**

Student feedback is a fundamental part of professional teaching practice. In contrast to forms of organizational assessment such as teacher evaluations, which always serve an allocation or selection purpose (e.g. promotion, access to functional positions), feedback has the aim of personal professional development. This development requires a critical reflection that compares one's own experiences with external information, and students can provide this information in a reliable and valid way.

Among wide media interest, two attempts (in 2017 and 2019) were made in Germany and Austria to create online platforms that allowed students to rate their teachers publicly. These platforms (spickmich.de and lernsieg.de) both claimed to

K. Zierer e-mail: klaus.zierer@phil.uni-augsburg.de

B. Wisniewski (B) · K. Zierer

Faculty of Philosophy and Social Sciences, University of Augsburg, Augsburg, Germany e-mail: benedikt.wisniewski@phil.uni-augsburg.de

provide feedback for teachers in order to improve teaching. By means of categories such as "professional competence", "motivation", "popularity", "clothing", "fair examinations", or "physical appearance", teachers could be evaluated anonymously with grades. After a certain number of ratings per teacher, the results were then made publicly accessible. Due to several complaints by rated teachers and by teachers' unions, both platforms were turned off.

What both platforms had in common were partly irrelevant evaluation criteria (e.g., clothing), evaluation criteria that included areas which could not be (sufficiently) assessed by students (e.g., professional competence), and—as the most critical aspect—a publication of the results.

Starting with these two negative examples, we will show how the functions of student feedback can be defined in a professional context: After a conceptual delimitation, we will point out why feedback is important for the professional development of teachers in general. After that, we will discuss three basic functions of student feedback: the development of teaching, the democratization of schools, and the improvement of teachers' satisfaction and health. In the last step, we will—in brief—propose success conditions of student feedback.

# **2 Feedback, Evaluation, Assessment, and Rating—A Conceptual Delimitation**

Because grading plays a central role in most school systems around the world and teachers usually provide feedback in the form of grades, student feedback is often equated with grading teachers (Elstad et al., 2017). The terms "student feedback", "student assessments", "student ratings", and "student evaluations" are used many times in a more or less synonymous way. It is assumed that students grade their teachers—similar to how teachers grade their students. Feedback is considered primarily a summative form of evaluation, rather than a formative form of providing information for professional development. Consequently, parallels are drawn between student feedback in school and student evaluations of teaching at university, the latter of which are widely used for selecting and promoting academic staff. The problems with evaluations of teaching in higher education have been discussed by Sproule (2000, 2002), who argues that the adoption of the "consumer" model of education does not capture the pedagogical process in its entirety, overlooking the students' influence on this process, and that false consequences are drawn from SETs. Research in the higher education context also shows no or only minimal correlation between SET and learning outcomes (Uttl et al., 2017, see Chap. 15 of this volume). Of course, findings like these could be used as arguments against student feedback, but the conceptual blur resulting from different concepts requires a delimitation of what feedback means, what distinguishes it from evaluation, assessment, and ratings (Table 1), and then to define what student feedback really means when we talk about its functions.


**Table 1** Conceptual delimitation (Zierer & Wisniewski, 2018)

A delimitation is of great importance for further discourse on this subject. Student feedback in schools is not synonymous to student evaluations of teaching or student ratings (concepts primarily used in higher education). Basically, and primarily, it provides information for the teachers who obtain it in order to get an impression of how their students experience their teaching. However, studies show that—just like in the higher education context (Marsh & Dunkin, 1992)—student feedback in schools is very often used for evaluation and assessment purposes rather than as an opportunity for personal change (Elstad et al., 2017) and that instruments are used which do not do justice to the actual purpose, for example by being inappropriate for innovative forms of teaching (Kember et al., 2002). When feedback is used at the end of a term, students believe that their feedback to teachers does not change anything in the classroom (Chen & Hoshower, 2003; Spencer & Schmelkin, 2002). When evaluation rather than professional development is emphasized, teachers see student feedback as a controlling tool (Harvey, 2002; Newton, 2000). The formative and summative components of feedback are not categorically incompatible, but an over-emphasis of the summative components can undermine the use of feedback and negatively affect school climate (Ford et al., 2018).

In the following, we will focus on functions of student feedback in schools obtained by teachers in order to acquire information on how students perceive teaching in a formative sense and neglect a more detailed discussion of summative functions used by school administrations to select or promote teachers.

# **3 Why Student Feedback Is Important**

The explanation and prediction of the feeling of professional success and professional satisfaction of teachers are often attributed to largely unchangeable and unlearnable personality traits. This attribution is evident in both beginners and experienced teachers (Bromme & Haag, 2004). If one holds the view that stable personality traits are largely responsible for one's professional success, feedback is mostly irrelevant. However, empirical research shows that the concept of "the born teacher" is outdated. It is not the unchangeable characteristics that primarily influence the quality of teaching but rather professional skills and knowledge, motivation, self-regulation, and attitudes (Zierer, 2015). All these are qualities to work on that require constant reflection based on data.

Feedback contains an oral or written external perception after a data collection, whereby these data can be in the micro range as sensory impressions or perceptions of a counterpart (for example the perception of facial expressions and gestures), and in the macro range of an observer in the form of multi-perspective data collection with differentiated methods and instruments, for example, feedback questionnaires (Buhren, 2015). Increasingly, teachers are confronted with the expectation of being reflective practitioners (Schön, 1987) who can develop their professional skills throughout their professional lives (Staub, 2001). There are numerous, partly very different, definitions of professional development (Reh, 2004), but, despite differently substantiated theoretical concepts, a large consensus can be established that reflexivity is a core area of professionalism (ibid.). A (self-)critical reflection that uses both one's own experience and external information forms the core of pedagogical professionalism (Paseka et al., 2011). For this reason, obtaining feedback is a core component of teachers' professional development. As active directors of instruction, they have a very high impact on their students' achievement (Hattie, 2009). However, not all teachers have the same influence. It is particularly high when they try to see teaching through the eyes of their students, when they try to understand how their teaching impacts the learners (ibid.).

According to control theory (Carver & Scheier, 1982), people constantly compare their performance to a behavioral goal and, when they detect a discrepancy, attempt to reduce this discrepancy. Feedback is a necessary prerequisite of professional reflection, increasing the awareness of behaviors and the impact of these behaviors. It helps to question automatic processes, habits, and routines, providing opportunities for behavioral change. Additionally, feedback influences motivational processes by reducing negative emotions caused by an observed discrepancy between goals and performance and fostering positive emotions by decreasing such a discrepancy (Deci et al., 1999). Furthermore, performers do better on tasks for which higher quality feedback is available (Northcraft et al., 2011).

When teachers state that they do not need feedback because they know best how effective their teaching is, it must be noted that the self-assessment of one's own competences is often wrong. This can generally be proven for different tasks and requirements (Kruger & Dunning, 1999). In the worst case, the consequence is that students become bored in class, learn less than they could, and the teacher still assumes that he or she is offering the best possible instruction. Feedback serves to prevent such misjudgments by providing information that is only accessible through an external perspective (Wisniewski & Zierer, 2019).

Feedback is an essential prerequisite for goal-oriented and self-reflective processes because teachers, like any other professional group, have so-called "blind spots" in their professional practice, as described in the model of the Johari window

**Fig. 1** Johari window for corporate settings (Luft & Ingham, 1955)

(Luft & Ingham, 1955, see Fig. 1), a model developed for corporate settings. Like in any other professional context, there is certain relevant information for teachers that is not accessible to a person him or herself but only accessible to others. The relevance of blind spots can range from minor to major—from the frequent repetition of a certain filler word and unfavorable non-verbal signals to the fact that a teacher explains content too quickly or too incomprehensibly (Wisniewski & Zierer, 2019). The only way to gain access to such blind spots is feedback.

A classic blind spot of teachers is, for example, their estimation of their own speaking time in class. Thus, Helmke and colleagues (2008) were able to show that teachers' estimation of their speaking time during a lesson differs considerably from the time objectively measured. In short: Teachers talk way more than they think they do (Fig. 2). The example shows that there are highly relevant characteristics of teaching that are not accessible through pure self-reflection but need to be communicated from an external perspective.

In this sense, feedback offers the opportunity to reveal blind spots by comparing perspectives. Blind spots can refer to critical aspects of behavior (like in the presented example), but also to strengths and resources that a teacher does not perceive from his or her own perspective. Student feedback can provide teachers with information on both, unknown strengths and unknown weaknesses.

**Fig. 2** Speaking time by teachers in the classroom—estimation and actual measurement (Helmke et al., 2008, p. 139)

# **4 Developing Teaching and Teachers with Student Feedback**

# *4.1 Development of Teaching*

What teachers actually do in their classrooms is one of the strongest predictors of students' learning outcomes (Hattie, 2009; Helmke, 2017; Seidel & Shavelson, 2007). Consequently, it is crucial to pinpoint what works in the classroom. Student feedback is supposed to help teachers improve the quality of teaching (Ditton & Arnoldt, 2004; Gärtner, 2007, 2013; Helmke, 2017) by providing diagnostic information on teaching characteristics that determine if students feel sufficiently challenged, engaged, and comfortable asking for help, telling teachers where they need to focus so that their current students benefit, suggest students' misunderstandings, and diagnose teachers' specific attempts at clarification (Gates Foundation, 2012).

Theoretically, improvements of teaching quality by student feedback can be explained in three ways: Firstly, feedback helps teachers to gain information about relevant lesson characteristics that are not accessible through pure self-reflection. Procedures that question learners consciously and directly about the core components of teaching provide opportunities for developing instructional quality. Kunter and Voss (2013) distinguish surface structures (characteristics that are directly observable, e.g., social forms, forms of teaching, methods, media use) from deep structures of teaching (characteristics that become visible through the interpretation of the teaching-learning process, and classroom interaction). Effectiveness of teaching depends largely on the latter (Hattie, 2009) and student feedback is a relatively reliable and valid information source on deep structures (Wisniewski et al., 2020a; see Chap. 7 of Göllner et al. in this volume). They can reveal if one teaching method produces better learning results than another, if assignments were clear and produced the intended effect, if students felt comfortable and challenged, if content was properly consolidated, if learning time was used efficiently, if students were able to work without disturbance, if the feedback which students got from the teacher was helpful, and so on. Teachers get an impression which of these aspects were seen critical, but also which of them were perceived favorably by students. It is becoming apparent that positive feedback leads to a further strengthening of the methods that have been successfully used in class, and effects can be seen in the tendency to make teaching more transparent and to regularly reflect on the lessons with the pupils (Gärtner, 2013). Ideally, comparing the student perspective to the teacher perspective and subsequent discussion leads to conclusions on how to optimize teaching (Desimone, 2009). This allows a shift of focus from surface structures of teaching and formal specifications (Have all curriculum goals been achieved?) to actual learning processes that have (or have not) happened in the classroom and helps to answer the question why this was the case.

There can be an action-guiding function of feedback: People, in general, act differently, when they expect feedback (Carver & Scheier, 1990). Feedback increases general self-awareness, and—consecutively—increases an individual's capability to inhibit behaviors that are undesired or dysfunctional (Alberts et al., 2011). When teachers know that they will get feedback based on certain criteria, they are likely to pay particular attention to these criteria (a reason why valid criteria for student feedback are crucial, see Chap. 4 of this volume). Assessing expectancies *before* getting feedback can already cause behavioral change. For example, a teacher who expects to get feedback about clarity will most probably be more aware of this aspect and put more emphasis on clarifying than without the expectancy. Similarly, a teacher who expects feedback on classroom management will monitor student behavior more carefully than without the expectancy.

Thirdly, feedback can help to implement innovations in teaching. Professional development aims to achieve change with regard to teachers' attitudes, beliefs, and perceptions that will result in improved student achievement or other desired outcomes. It has been shown by research that changing teachers' attitudes, beliefs, or perceptions requires the experience of successful implementation (Guskey, 2002). Thus, student feedback is one key element in the implementation process, being able to demonstrate that an innovation works (or doesn't work). Student feedback was found to increase implementation of innovations (Mortenson & Witt, 1998; Noell et al., 2002), providing information on whether innovations have a positive effect.

# *4.2 Democratization of Schools*

The development of teaching is often focused on effectiveness, aiming at an increase in student achievement. Student feedback can contribute to the development of teaching in an additional way by promoting democratic attitudes. Feedback between the participants of school life is a basic condition for participation and therefore the experience of democratic structures. While professional feedback from teachers to students is the basis for an appreciative climate and successful teacher–student relationships, student feedback is the basic form of freedom of expression with regard to successful learning conditions and the prerequisite for a dialogue about teaching and learning (Wisniewski et al., 2019).

School has an indirect or latent influence on the political socialization of pupils in the sense of the social-cognitive learning theory. A prerequisite for successful interaction is the granting of mutual recognition and appreciation, not as a sufficient, but at least as a necessary condition for a democratic form of school life. "When developing a political standpoint, young people apparently pay less attention to bold confessions and teachings than to the nuances both in interpersonal relationships and in the context of educational institutions" (Kleeberg-Niepage, 2012, p. 13, translated). The participation of young people in discussions and co-determination processes in educational institutions plays an important role (see Chap. 13 of this volume).

Student feedback contains several components of a basic understanding of democracy: Students are given the opportunity to express their opinions in a differentiated way. They have to think about how different criteria for the quality of teaching are to be assessed in each individual case instead of assessing teaching in general as "great" or "bad". They realize that their own opinion is not to be seen as absolute, but that there are different perspectives on a subject. They learn to engage in a dialogue with their teachers on how changes can lead to better conditions for all those involved and thus influence an area relevant to them—nothing other than social participation in the school system. And finally, feedback offers the opportunity for mutual appreciation between teachers and learners.

# *4.3 Improving Teachers' Satisfaction and Health*

Student feedback can help to improve teaching, but an additional—and often overlooked—potential benefit it provides is teachers' development of a professional experience that is more satisfactory, and—as a consequence—healthier. Although, this may seem contra-intuitional because feedback can (and often does) include criticism, research suggests a cautious assumption of such positive effects.

Teachers' satisfaction is a key affective reaction to working conditions and an important predictor of teacher attrition (Ford et al., 2018). It is related to their expectations of self-efficacy, in other words the belief that they can produce desirable changes in student achievement (Ford et al., 2018; Skaalvik & Skaalvik, 2007; Wang et al., 2015), which in turn is enhanced by feedback. Teachers with low self-efficacy expectations do not believe that they can successfully provide instruction that will increase student performance (Finnegan, 2013), whereas teachers who experience that their use of feedback leads to positive changes in their practice have higher satisfaction than those who don't (Ford et al., 2018). When teachers are given areas to improve or reflect on, their perception of the effectiveness is higher than when only praise is given (Milanowski & Heneman, 2001).

Enns and colleagues (2002) have been able to demonstrate that teachers who seek regular feedback in their professional practice


In this sense, feedback does not—as one might expect—demotivate teachers by criticism, but, contrary to this, support and encourage them. Feedback even has this motivating effect, regardless of whether it is positive or negative (Pritchard et al., 2002). Further, it leads to a more realistic self-assessment (Mayo et al., 2012), promotes a solution-oriented approach to problems (Enns et al., 2002), and increases the experience of self-efficacy. Considering this, the reflection on lessons with the help of external data can be one of the most important resources for satisfactory professional practice.

Finally, job satisfaction has an effect on teachers' health. Symptoms of burnout (emotional exhaustion and depersonalization) are negatively related to teacher selfefficacy (Skaalvik & Skaalvik, 2007) and teachers with a high sense of efficacy seem to employ a pattern of strategies that minimized negative emotions (Finnegan, 2013). It is at least plausible that an increase in the above-mentioned areas will in turn have a positive effect on the quality of teaching. Reciprocally, students give more positive feedback to teachers who—in the sense of a low psychosocial risk for stress symptoms—show a favorable combination of work commitment, resilience and emotions, a high degree of resistance to professional problems, and a higher level of positive emotions (Klusmann et al., 2006). Consequently, student feedback can make a significant contribution not only to job satisfaction, but to the health of teachers.

# **5 Success Conditions of Student Feedback**

We have tried to show in this chapter that student feedback has a number of important functions for the development of teaching and teachers. However, there are several success conditions that are a prerequisite for student for feedback to be able to really fulfill these functions. Therefore, we propose the following four criteria:

1. The aim of student feedback needs to be transparent to all participants.

Formative student feedback with the purpose of personal development must be clearly separated from any forms of summative evaluations, assessments, or ratings that are used for administrative decisions. Transparency is also needed with regard to the availability of feedback results: the obtaining teacher should be able to decide who has access to these results.

2. Student feedback needs to be informative.

Feedback is most useful when it contains a high amount of information (Hattie & Timperley, 2007; Wisniewski et al., 2020b). Consequently, student feedback should provide information that allows the teacher to gain detailed insight into strengths and weaknesses of her or his teaching, pointing at opportunities to make suitable changes and reinforcing functional behavior.

3. Student feedback needs to be based on sound criteria.

In many schools, ad hoc instruments that are mainly based on everyday assumptions and not on sound theory are used to obtain student feedback (Ory & Ryan, 2001). This brings the disadvantage that criteria are highly subjective and arbitrary. Useful student feedback is based on criteria whose importance is supported by empirical evidence and which cover deep structures of teaching (with positive effects on student learning).

4. Teachers need support when dealing with student feedback.

The most crucial step in the process of using student feedback is not obtaining information but dealing with the information. Penny and Coe (2004) have shown the importance of supporting teachers when dealing with feedback information. High impact was found when teachers had various support systems at hand, including counseling and coaching.

# **6 Conclusion**

The various functions of student feedback suggest that it is a self-evident part of teachers' professional development, providing valuable information with no or low cost. It is therefore rather astonishing, that it is still not a matter of course in schools. Student feedback helps to get into conversation about teaching and learning. Sometimes this is the beginning of a real feedback culture.

8 Functions and Success Conditions of Student … 135

# **References**


**Benedikt Wisniewski** is a school psychologist, former teacher, and teacher trainer. As a researcher and lecturer, he works at the University Augsburg (Germany). His research focuses on students' and teachers' perceptions of instructional quality and the validity of student feedback. In 2014, he co-founded *FeedbackSchule*, a German online platform for obtaining professional student feedback.

**Klaus Zierer** is a Professor of School Education at the University of Augsburg (Germany). He taught for five years in primary and secondary schools. In 2009 he was a Visiting Research Fellow at the Department of Education, University of Oxford (UK). He is currently an Associate Research Fellow of the ESRC Centre on Skills, Knowledge and Organisational Performance (SKOPE). His research interests include international aspects of school education, learning and teaching, teacher education, and professionalization.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 9 Effects of Student Feedback on Teaching and Classes: An Overview and Meta-Analysis of Intervention Studies**

#### **Sebastian Röhl**

**Abstract** Based on a comprehensive literature review of student feedback intervention studies in schools, this chapter provides an overview of found effects on teachers and teaching. The first part summarizes the self-reported cognitive, affective, and motivational effects of student feedback on teachers, which can subsequently lead to behavioral changes in the classroom. In the second part, the focus is on the extent to which these behavioral changes are perceived by students. For the first time, a meta-analysis of changes in students' perceptions of teaching was carried out for the 18 existing longitudinal studies for this purpose. A small but significant positive weighted mean effect size of *d*=0.21 for students' perceived improvement of teaching quality was found, while more in-depth analyses pointed to a beneficial effect of individual support measures for teachers regarding reflection and subsequent development of teaching. Implications for further research and practical implementation of student feedback in schools are discussed.

**Keywords** Student feedback · Meta-analysis · Effects · Intervention studies · Teacher · Teaching development

# **1 Introduction**

Feedback can be understood as a communicative process "in which some sender […] conveys a message to a recipient. In the case of feedback, the message comprises information about the recipient" (Ilgen et al., 1979, p. 350). This information can be used by the recipient to improve task performance (Kluger & DeNisi, 1996) or to enable and develop learning processes (Hattie & Timperley, 2007). In the case of student feedback, the feedback recipients are teachers, who receive information on teaching from their students in class as senders. As described in the Introduction of this volume and Chap. 8 by Wisniewski and Zierer, the received feedback

S. Röhl (B)

© The Author(s) 2021 W. Rollett et al. (eds.), *Student Feedback on Teaching in Schools*, https://doi.org/10.1007/978-3-030-75150-0\_9

Institute for Educational Sciences, University of Education, Freiburg, Germany e-mail: sebastian.roehl@ph-freiburg.de

should contain useful and meaningful information for the given teacher. As a first step, the feedback could therefore have positive cognitive and possibly also affective and motivational effects on the teacher. Subsequently, this could lead to changes in teacher behavior, thus promoting development and improvement of teaching and professionalism. This in turn could lead to a more positive perception of teaching by students.

This overview chapter follows this process and is based on a comprehensive literature review of studies dealing with student feedback as an intervention for the improvement of the teaching quality of fully trained teachers. In the first part, findings on teacher-reported effects from student feedback are summarized. The second part contains a meta-analysis of findings of longitudinal student feedback intervention studies, which almost exclusively examined changes in teaching and classes from the perspective of students in secondary schools. Remarkably, no studies could be found which were conducted in grades one to four.

This chapter complements Chap. 11 by Göbel et al., which describes the use of student feedback in the context of the first and second phases of teacher training. In Chap. 12, Schmidt and Gawrilow describe how student feedback can be used to improve the cooperation between teachers and students. Furthermore, teachers' productive use of student feedback depends on various individual and situational characteristics, and this is described by Röhl and Gärtner in Chap. 10 and in the Introduction of this volume.

# **2 Self-Reported Effects of Student Feedback on Teachers**

Whether a feedback message leads to visible changes in the recipient's behavior depends on the effects of the feedback message on the recipient—in this case the teacher. Therefore, this part offers an overview of literature on self-reported effects of student feedback on teachers. For the teacher obtaining feedback, student feedback can have effects at different levels (see *processes and effects of student feedback model (PESF)* in the Introduction of this book). Here, a distinction can be made between affective, cognitive, and behavioral effects, which in turn are related to motivational processes.

Regarding *cognitive* effects of obtaining student feedback, several studies reported an increasing amount of reflection by teachers on their actual practice due to aspects of teaching quality included in the used feedback questionnaires (Gärtner & Vogt, 2013; Göbel & Neuber, 2019; Mandouit, 2018). As a result of the feedback received, teachers express an improvement regarding their understanding of how students perceive their teaching and classes (Gage, 1963; Thorp et al., 1994;Wyss et al., 2019). Furthermore, student feedback can help teachers to find students' misconceptions about learning (Mandouit, 2018). Subsequently, teachers identified possible areas for improvement (Barker, 2018; Gaertner, 2014). As a side effect, the first-time use of student feedback can lead to a more positive attitude towards this instrument (Brown, 2004; Campanale, 1997; Gaertner, 2014), although opposite effects such as a higher skepticism have also been observed (Dretzke et al., 2015).

On the *affective* level*,* many teachers experience emotions of happiness and curiosity during the feedback reception and reflection, especially if the feedback is perceived as positive (Villa, 2017). Other teachers reported emotions of anger due to feedback perceived as negative, or sadness due to helplessness regarding a possible improvement of their own teaching (Brown, 2004; Gärtner & Vogt, 2013; Villa, 2017).

Both cognitive and affective effects can impact motivational processes and lead to changes on the *behavioral* level. Teachers expressed that they paid more attention to identified improvement areas during preparation and teaching, sometimes resulting in a self-perceived improvement (Balch, 2012; Gaertner, 2014; Rösch, 2017). In addition, some teachers planned to participate in relevant professional training programs (Balch, 2012). Another behavioral outcome is the discussion about feedback received and teaching with the corresponding class, which was seen by many teachers as an important further source of information about their own teaching and a common ground for changing teaching practices (Gaertner, 2014; Thorp et al., 1994). In addition, teachers mentioned changes in their behavior before obtaining student feedback. While reflecting on the feedback questionnaire, they prepared the lessons in which the instrument was to be used more carefully, in line with the questionnaire's quality criteria (Balch, 2012; Rösch, 2017).

# **3 A Meta-Analysis of Longitudinal Studies on the Teaching-Related Effects of Student Feedback Interventions**

Without a doubt, it is desirable that positive effects of student feedback are not only reported by teachers, but that they also become evident in student perceptions and learning achievement. Based on the process model of Student Feedback on Teaching (SFT, see the Introduction to this book), this process can only be achieved if several conditions are met. First of all, the students have to report back that there is a need for improvement. This must be perceived and accepted by the teacher in the feedback reports. Furthermore, it is necessary that the teacher creates a desire for change or sets goals and then pursues them. Subsequently, a teacher's behavioral change should improve students' learning processes—and students have to perceive this behavioral change—before a positive effect of student feedback on teaching and classes becomes visible.

While intervention studies on the use of students' achievement data for the instructional development in schools also focus on student achievement (e.g. Keuning et al., 2019; van der Scheer & Visscher, 2018) or the improvement of teachers' instructional skills (van der Scheer et al., 2017), the overwhelming focus of investigations into student feedback has been on the effects of student perception of teaching behavior. In the literature review performed here, one single study (Novak, 1972) was found which additionally analyzed several audio-recorded lessons before and after the student feedback intervention on changes in teacher behavior. The findings of this study pointed to significantly lower proportions of teacher talk and lectures during the lessons following repeated reception of student feedback. Regarding possible effects of student feedback interventions on students' motivation, findings of a single study (Tozoglu, 2006) indicated a small positive effect (*d* = 0.289), but only for teachers who received enhanced support for interpreting feedback and teaching development. No effect was found for teachers who received only student feedback mean scores without any support. In a dissertation study, Kime (2017) measured students' achievement scores in the context of teaching evaluations based on student ratings, comparing a group of teachers receiving student feedback only with another group which carried out additional peer coaching on the feedback received. Contrary to Kime's expectations, analysis could not prove a significant effect on achievement scores for the peer coaching condition. However, a comparison with teachers who did not receive student feedback was not possible due to the lack of an appropriate control group. With regard to the question of the extent to which primary school pupils perceive an improvement in the quality of teaching, a study by van der Scheer (2016) resulted in no changes in pupils' rating of teaching quality during a data-based decision-making intervention, whereas pupils' learning achievement significantly improved. Research concerning effects on students' learning achievement, comparing teachers receiving student feedback with non-receivers, is still absent.

While in the field of university and college teaching some meta-analyses of effects of students' mid-term feedback on classes already exist (e.g. Cohen, 1980; L'Hommedieu et al., 1990; Penny & Coe, 2004), a meta-analysis regarding effects in schools is still pending. The meta-synthesis regarding feedback by Hattie (2009), which resulted in *d* = 0.73, and also a recent and thorough meta-analysis of the underlying primary studies with a lower effect size of *d* = 0.48 (Wisniewski et al., 2020), mainly include feedback from teachers to students, with the exception of three meta-analyses of effects of student feedback in higher education. For the context of higher education, Cohen's (1980) meta-analysis of 17 intervention studies resulted in an effect size of *d* = 0.20 on students' end-of-semester ratings of classes for providing mid-term feedback to university teachers. If the feedback is accompanied by further measures such as individual consultation, this effect increases to an average of *d* = 0.64. Penny and Coe (2004) found an average effect size of *d* = 0.69 for student feedback augmented with peer and expert consultation in their analysis of 11 intervention studies. The analysis of 28 studies by L'Hommedieu et al., (1990) resulted in - = 0.34. Uttl et al. (2017) conducted a meta-analysis of 51 studies on the relation between student evaluation of teaching ratings and student learning achievement. The results indicated no significant overall correlation. In order to close this research gap in the field of primary and secondary schools, a meta-analysis is now presented here, which includes student feedback intervention studies while surveying changes in students' perception of teaching quality.

# *3.1 Measures and Methods*

#### **3.1.1 Literature Search**

For this overview, a comprehensive literature search using the terms "student feedback", "pupil feedback", and "self-evaluation" was conducted in the databases ERIC, PsycInfo, Scopus, Web of Science, ProQuest, and OpenDissertations. As most of the studies found focus on student feedback in higher education, the search was limited to publications which did not contain this keyword. In a second step, articles with a theoretical or practical focus were excluded. Next, only intervention studies which reported pre- and post-measures were selected. In addition, some non-catalogued studies mentioned in scientific articles on student feedback were found. More details about the studies included can be seen in Part III.

#### **3.1.2 Study Coding**

Regarding possible moderators, most of the different study characteristics are explicitly reported, such as the existence of a control group, the number of feedback reports, the duration of the treatment, and the publication type. For the level of provided support for the participating teachers (see below), a coding was conducted by two trained raters. The inter-rater agreement was high (ρ = 0.85, *p* < 0.001), and in the subsequent discussion a consensus was reached on the different opinions.

#### **3.1.3 Effect Size Calculation and Analysis**

The dependent variable in this meta-analysis is the student-perceived change in the quality of teaching. As the included studies use different questionnaires for student feedback, single scales or constructs are not comparable across the studies. Therefore, in order to achieve comparability of the effects, it was decided to calculate the arithmetic mean of all reported effect sizes included in each study for the students' perception of teaching as an overall effect.

Effect sizes are calculated using Cohen's *d* with groups-size-adjusted standard deviation (σpooled, Morris & DeShon, 2002). Effect size variances were estimated following Lipsey and Wilson (2001, pp. 44–49). If available, *d* was estimated using the reported means and standard deviations of pre- and post-measurements on teacher level. Otherwise, available *t*, *F,* and χ<sup>2</sup> statistics were used.

In this meta-analysis, longitudinal studies with and without a control group design are included. This led to some problems in the estimation of comparable effect sizes and variances:

(a) Several studies without control groups didn't include the standard deviations of the measurements and the correlation between the pre- and post-test scores. While comparable effect sizes can be estimated without this information using reported *t*- or*F*-values (Lipsey &Wilson, 2001), the variances of the effect sizes can only be estimated if the standard deviations or correlations are available. For this meta-analysis, several solutions were considered. The most conservative approach would be to assume no correlation between the two measurement time points, which would lead to a strong overestimation of variances. However, many studies report quite high consistency of student ratings on teaching quality over time (e.g. Polikoff, 2015; Rowley et al., 2019). In addition, the calculation of the correlation between teachers' pre- and post-measures using available data from two studies (Bartel, 1970; Ditton & Arnold, 2004) results in values of*r* > 0.73. Therefore, following the suggestions of Borenstein et al. (2009), we assumed a lower limit of *r* = 0.70 for the estimation of effect sizes variances.

(b) Many studies with a control group design showed a moderate decrease of control groups' student ratings on teaching quality between the measurement time points (Buurman et al., 2018; Gage, 1963; Nelson et al., 2015; Tacke & Hofer, 1979; Tuckman & Oliver, 1968). For the studies using a control group design, this tendency is already considered in the estimation of effect sizes. However, assuming that this effect is also evident in the treatment group, this could lead to an underestimation of the strength of the effect in designs without the control group. Therefore, possible moderator effects regarding the design of the study are included in our analyses.

Because of the heterogeneity of treatment and design characteristics of the included studies, random-effect models appeared to be more suitable than fixed-effect models for this meta-analysis (Borenstein et al., 2009). For the estimation of assumed moderator effects of study and treatment characteristics, separate mean weighted effect sizes and confidence intervals for every subgroup were estimated (Borenstein et al., 2009). Regarding continuous study characteristics such as the number of feedback reports and the intervention duration, the studies were split at the median. Estimation of the overall and moderator effect sizes and confidence intervals was done using the package *metafor* (Viechtbauer, 2010) in *R* (R Core Team, 2019). In addition, as three studies included several effect sizes by different intervention groups, a sensitivity analysis was conducted with regard to bias due to possible dependencies (Hedges et al., 2010). This revealed that the resulting biases are about *d* = 0.0001, and therefore negligible.

Analysis on possible outliers or influential studies was conducted. We chose to use Cook's distance (Cook & Weisberg, 1982) test statistics for residual heterogeneity when each study is removed in turn (Viechtbauer, 2010), and the distribution of weights of the included studies as indicators.

# *3.2 Characteristics of Included Studies*

In the literature review, 18 longitudinal studies with student feedback treatments published between 1960 and 2019 were identified (see Table 1). The design of these studies is experimental or quasi-experimental. Thus, all studies include at least one pre- and one post-measurement of students' perception of teaching quality, but not all of them provide a control group comparison. Seven of the studies were conducted in the USA, three more took place each in Australia and Germany, two in the Netherlands, and one each in Great Britain, Turkey, and Austria.

All studies utilized questionnaires which were mainly based on closed questions or rating scales. The research teams carried out the counting and provided a feedback report to the teachers. One study used a digital smartphone-based feedback system for this purpose (Bijlsma et al., 2019). All included studies were conducted in grade 5–13. While five interventions were limited to exactly one grade level, the other studies involved teachers from different levels. Three interventions also continued to restrict the subject matter for a better comparability of the classes. Novak (1972) focused on biology teachers, Rösch (2017) on physics, and Bijlsma et al. (2019) on mathematics.

The findings on the effects of a student feedback intervention on changes in teaching behavior perceived by students are heterogeneous in the studies. While two studies show clearly negative treatment effects (Bennett, 1978, *d* = −0.30; Knox, 1973, *d* = −0.24),1 most studies report effects ranging from *d* = 0.1 to *d* = 0.5.

Furthermore, some studies instruct teachers to focus on only one to three areas for improvement in subsequent classroom development (Fraser & Fisher, 1986; Fraser et al., 1982; Nelson et al., 2015; Thorp et al., 1994). However, information on which aspects were selected by teachers for improvement is only available for the three case studies. As expected, results show the highest improvements in the targeted areas (up to *d* = 0.8), whereas the other scales do not change. Another study (Mayr, 1993, 2008) examined only individual areas of teaching which had been agreed with the teachers. However, as there is a complete lack of such information for all other studies, the individual prioritization of certain areas by individual teachers cannot be considered in this meta-analysis, and so we used the average effect sizes of all scales in each study. This also means that the average overall effects of all included scales are smaller than the reported bigger improvements in some selected scales.

The *sample size* differs greatly between the studies. Whereas some have reported case studies with single teachers (Fraser & Fisher, 1986; Fraser et al., 1982; Thorp et al., 1994) or one team of five teachers (Mandouit, 2018), the other studies used sample sizes ranging from *N* = 10 to *N* = 508 teachers. Also, the duration of the intervention varied between the studies from one month to one year, with an average of *M* = 3.06 months. During these periods, a different *number of feedbacks* were reported to the teachers. In most of the studies, the last feedback report was used

<sup>1</sup> Noteworthy, both studies were conducted by persons from the school administration. To what extent the negative effect can be explained by possible refusal attitudes of subordinate teachers cannot be clarified here due to the small number of studies.


aExcluded after outlier analysis as post-measure of changes in the student perceived teaching quality or teacher behavior. Therefore, for comparability reasons, we counted the number of student feedback reports before the last measurement. Whereas 11 studies reported only one student feedback measurement to the teachers, the other studies obtained and reported feedback up to five times. A special case in point is the study by Bijlsma et al. (2019), where teachers could use the smartphone app to obtain feedback as often as they wanted. The frequency varied between 4 and 17 feedback measurements, with an average of 6.7 for these teachers.

The studies reported here differ also in the manner and amount of*support* provided for the feedback interpretation and subsequent developmental processes. In line with the meta-analysis results from higher education described above (Cohen, 1980; Penny & Coe, 2004), findings on teachers' use of students' achievement data pointed out that solely providing data rarely leads to subsequent changes in teaching (Schildkamp et al., 2015). Thus, it seems to be important to consider this characteristic of the interventions. Furthermore, three of the included studies analyzed different treatment conditions (Bartel, 1970; Bell & Aldridge, 2014; Tozoglu, 2006). One part of the teachers received written feedback without further instructions, while the other part received additional reflection impulses and counseling. All three studies showed significantly more positive effects for the latter condition. For this reason, the effects of these different treatments are reported as two separate effect sizes for each of these studies in the meta-analysis. During the coding process of the support by the raters it became apparent that the following three levels of support can be distinguished:


A further distinguishing feature of the studies is the *type of publication*. While the findings of some studies were published in peer-reviewed journals, others were only available as reports or university theses and required a high search effort to find them. If only studies from scientific journals are included in meta-analyses, this easily leads to a so-called "publication bias", since these usually contain higher effects and more significant findings than those not included in such journals (Lipsey & Wilson, 2001). An analysis on differences of effects between publication types could provide indications on whether a publication bias also exists for this research field (Borenstein et al., 2009). Of course, this leaves the question unanswered to what extent further studies exist which could not or cannot be found.

# *3.3 Results of the Meta-Analysis*

A first estimation of the mean weighted effect size using all 21 effect sizes found in a random-effects model resulted in *d* = 0.23 (*p* < 0.001, 95%-C.I.: 0.13–0.33). Analyses of influential studies pointed to an overweight of the reflection group in the study of Bell and Aldridge (2014) because of the exceptional sample size. In addition, analysis of the residual heterogeneity led to the exclusion of the enhanced feedback group from Tozoglu (2006) due to outlier characteristics of this subsample.

For the remaining 19 effect sizes, the estimation of the overall mean weighted effect size led to *d* = 0.21 (*p* < 0.001) with a 95% confidence interval of 0.11 < *d* < 0.32. The effect sizes with confidence intervals of all included studies are plotted in Fig. 1.

The inspection of the heterogeneity test statistics (*Q*(18) = 16.62, *p* = 0.549) reveals that the homogeneity of the effect size is statistically sufficient (Lipsey & Wilson, 2001).

#### **3.3.1 Moderator Analysis**

The resulting mean effect sizes and 95% confidence intervals of the subgroups split along the moderator variables are presented in Table 2. In line with the relatively small numbers of studies found, confidence intervals overlap mostly between the different subgroups.

The only study characteristic which turned out to be a significant moderator is the level of support. Treatments with a high level of individual support for reflecting on feedback and teaching development (level 3) showed a significantly higher effect size (*d* = 0.52, *p* = 0.010) than studies with a medium or low supportive level. Contrary to the assumptions, no significant differences were found between the effect sizes of studies including control groups and studies without (*d* = 0.21 vs. *d* = 0.24). The differences (presumed as considerable) between studies with only one or with more feedback reports (*d* = 0.25 vs. *d* = 0.01) were not statistically relevant (*p* = 0.123). The same applies to the differences regarding the treatment duration of the


**Table 2** Analysis of moderator effects regarding study and treatment characteristics

aSignificance of moderator, bDummy-coded

intervention and to whether the studies are published in scientific journals or only accessible as theses or reports.

# **4 Conclusion and Discussion**

In this overview chapter, findings of a comprehensive literature review on effects of student feedback interventions in schools were presented. In the first step, effects on teachers were summarized from the literature found. Regarding cognitive effects, studies reported reflective thinking processes on teachers' own perceptions and goals of teaching—initiated by feedback reports and also by questionnaire topics—which could lead to an identification of areas for improvement. In addition, a fostering effect on teachers' understanding of students' perception of teaching and learning processes was observed. Both positive (happiness, joy) and negative (sadness, feelings of helplessness) affective reactions are found with regard to the feedback received. Cognitive and affective processes can result in motivational effects, which could change teachers' behavior in classes. According to teachers' self-reports, these behavioral changes are apparent in a more intense preparation of lessons and a stronger perception and control of one's own actions in class, if they consider the feedback points as critical. Furthermore, teachers initiated discussions with students about the received feedback and the improvement of teaching and collaboration within the school class.

In a second step, this chapter examined whether and to what extent behavioral changes by teachers were perceived by the students. To answer this question, the first meta-analysis of effects of student feedback interventions on student-perceived teaching quality in schools was conducted, including 18 studies with 19 effect sizes. Using a random-effects model, a weighted mean effect size of *d* = 0.21 was found. Although this effect seems to be relatively small, it is significant and lies in a similar range to meta-analyses from student feedback use in higher education (Cohen, 1980; L'Hommedieu et al., 1990). Furthermore, it should be noted that these analyses were based on all the teaching characteristics assessed by the students, but teachers often focused only on specific areas for improvement. For the target areas, the case studies in particular showed considerably greater effects. In addition, the effect sizes varied to a considerable extent between the different scales of teaching dimensions used in the larger studies.

Additional moderator analysis showed an increase in the effect size to *d* = 0.52 for additional individual support, which is also in line with findings for college and university teachers (Penny & Coe, 2004). Other moderator analyses showed no significant effects. This emphasizes the important impact of providing appropriate teacher support for the feedback-related teaching development process, whereas other structural treatment characteristics play no or only a minor role. However, there were indications that further studies should pay particular attention to the number of feedback reports provided in longer-term studies.

Considering the findings of the first part of this chapter on the teacher-reported effects of feedback, the teacher's perception processes and reactions are the "needle's eye" for improving teaching. Therefore, support for teachers using student feedback should aim at facilitating a constructive cognitive processing of feedback and accompanying affective reactions, so that teachers can develop action alternatives and thus the motivation for change is fostered.

As a limiting factor for the meta-analysis presented, it should be noted that only relatively few studies were found. This reduces the power of the analyses of possible moderators. However, the similarity of the findings presented here to meta-analyses from higher education points toward validity of these results, together with the fact that there is no indication of a publication bias or design effect of the included studies. This chapter thus provides evidence for the effectiveness of student feedback as a tool for improving the quality of teaching perceived by students. It provides a comprehensive overview of the effects on teachers which have so far only been considered in isolation in studies. Furthermore, an extensive literature review and meta-analysis of intervention studies on student feedback in schools was presented for the first time.

Simultaneously, there are various implications for further research on the effects of student feedback in schools:


For the practical use of student feedback for teaching development in schools, this meta-analysis also results in several implications. Most importantly, the findings emphasize the need for support for teachers on using student feedback. This does not only concern the subsequent lesson development, but also support for the interpretation of feedback reports, dealing with accompanying emotions, identification of improvement areas, and how to work on them. This can for example take place through coaching and supervision, but also in collegial settings such as professional learning communities.

Additionally, when planning the implementation of student feedback in schools, there is a need to consider organizational characteristics which are beneficial for constructively dealing with feedback, as presented in Chap. 10 by Röhl and Gärtner in this volume.

**Acknowledgements** The author wants to thank Prof. Dr. Martin Schwichow and Prof. Dr. Wolfram Rollett from the University of Education Freiburg for their valuable suggestions and support for this study.

# **References**

*Studies included in the meta-analysis are marked with an asterisk\**


**Sebastian Röhl** has been an Academic Assistant in the Department of Educational Science at the University of Education Freiburg and is currently a Postdoctoral Researcher in the Institute of Education at Tübingen University (Germany). Before that he worked for more than 10 years as a grammar school teacher, school development consultant, and in teacher training. Among other areas, he conducts research in the fields of teaching development and teacher professionalization through feedback, social networks in inclusive school classes, as well as teachers' religiosity and its impact on professionalism. In addition, he is the director of an inservice professional master's study program for teaching and school development.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 10 Relevant Conditions for Teachers' Use of Student Feedback**

**Sebastian Röhl and Holger Gärtner**

**Abstract** Based on the findings from research on organizational feedback and data use in schools, this chapter systematizes relevant factors influencing the use of student feedback by teachers in three domains: (1) personal characteristics of feedback recipients (teachers), (2) characteristics of the organization (school), and (3) characteristics of feedback information (data). We identified teachers' self-efficacy, attribution styles, goal orientations, and age or professional experience as relevant individual characteristics. In addition, teachers' attitude toward students' trustworthiness or competence as a feedback provider appeared to be relevant for the use of student feedback. Beyond that, findings on organizational characteristics for teachers' successful dealing with feedback pointed to the importance of a feedback culture and organizational safety, leadership, supportive measures, and perceived function of feedback as control vs. development. Furthermore, relevant characteristics of feedback information were identified as comprehensibility, valence, and specificity. Although such findings from other fields of research have been known for some time, studies on student feedback concerning these aspects are rare. Finally, practical measures are derived for each of the three domains in order to increase the use of student feedbacks by teachers.

**Keywords** Student feedback · Teacher characteristics · Organizational characteristics · Feedback information · Data use

S. Röhl (B)

H. Gärtner Freie Universität Berlin, Berlin, Germany e-mail: Holger.Gaertner@isq-bb.de

University of Education, Freiburg, Germany e-mail: sebastian.roehl@ph-freiburg.de

# **1 Introduction**

Since the publication of Hattie's meta-analysis Visible Learning (2009), the use of student feedback as an effective method for improving the quality of teaching has moved strongly into the focus of educational practice (see Chap. 8 by Wisniewski and Zierer in this volume). Besides that, student feedback is also discussed in models of data-based decision-making, and is seen as an informative addition to the analysis of student performance data in order to provide teachers with well-founded information about areas in their teaching which they can improve (Lai & Schildkamp, 2013). The use of student feedback consists of two different phases: First, teachers ask their students for feedback on teaching and their perception of the learning environment. In a second step, this information can be used to develop teaching and professional competencies. This chapter focuses on the questions of (a) which factors have an influence on whether teachers collect feedback on their teaching and (b) whether they use this feedback to improve their teaching. To answer these questions, we examine two different research approaches. In one approach, the results of student surveys can be understood as feedback on teaching, so that theories and findings from the field of organizational feedback research can provide important information (e.g., London & Smither, 2002; Smither et al., 2005a). In the other approach, the results of student surveys can be seen as relevant data about teaching quality. In this case, theories and findings from the field of data-based decision-making in schools (e.g., Schildkamp, 2019; Schildkamp et al., 2015) are helpful in providing relevant information on this topic.

The research on feedback in organizational psychology focuses more on cognitive, emotional, and motivational processes of the feedback receiver, which can subsequently lead to behavioral changes, such as improved performance and increased commitment. From this point of view, it is necessary that the receiver accepts the feedback, desires to respond, develops and aims for alternative actions, and finally implements these (Ilgen et al., 1979; Smither et al., 2005a; Kahmann & Mulder, 2011; Introduction to this volume by Röhl et al.). For this process, the characteristics of (1) the feedback recipient, (2) the sender and his or her relation to the recipient, (3) the feedback message, and (4) the organization have proven to be important factors.

Models of data-based decision-making describe the process of using data to support school quality development (e.g., Brunner & Light, 2008; Helmke & Hosenfeld, 2005). Lai and Schildkamp (2013) define the process as consisting of five steps: (1) First of all, it is necessary to clarify within the school which question the data should answer, i.e. the purpose of the data collection. (2) Subsequently, the data considered as relevant are collected (e.g., performance data, classroom observations, survey data, school administration data, etc.). (3) The data collected are analyzed with regard to the initial question. This is followed by (4) an interpretation in terms of meaning for the initial question and the consequences they should have for school and teaching development. (5) The last step is an implementation of the planned measures in everyday school life. Based on well-known models of data-based school and teaching development, relevant influencing factors on the process of data use can be distinguished at different levels (Coburn & Turner, 2012; Visscher & Coe, 2003). Schildkamp et al. (2017), for example, differentiate between three relevant influencing factors on the process of data use: (1) characteristics of the school organization, such as the existence of support structures or the significance of data use for school management, (2) characteristics of the existing data, such as user-friendliness or timely provision, and (3) characteristics of data users, such as how qualified teachers are in analyzing data or what attitudes they have towards the use of data.

In addition, however, a differentiation should be made as to whether the collection of student feedback is voluntary or not, and for what purpose the feedback should be used. The following situations can be distinguished: (a) teachers voluntarily searching for feedback on their own initiative, (b) student feedback is delivered to teachers as established practice or given by the organization, but without official accountability purpose, and (c) student feedback with accountability purposes. Most of the literature included in this overview refers to situations (a) and (b).

# **2 What Are Relevant Conditions for Teachers' Use of Student Feedback?**

This chapter summarizes the empirical findings from literature on relevant influencing factors according to the three areas mentioned in both research fields (teachers as feedback recipients and data users, school organization, and feedback message or data). As the feedback senders in this context are uniformly students, teachers' perception of the students as competent in this point is particularly relevant.

# *2.1 Teachers as Feedback Recipients And Data Users: Relevant Individual Characteristics*

Studies on feedback use from organizational research usually focus on one or more of the following aspects: feedback-seeking behavior, acceptance, perceived usefulness, and performance improvement due to feedback.

*Older* employees or those with a longer professional experience perceive feedback as less useful (Ilgen et al., 1979). This tendency is also evident for teachers: older teachers seek less feedback from colleagues or peers (Kunst et al., 2018; Runhaar et al., 2010). Regarding student feedback, teachers with longer professional experience are more skeptical of the usefulness (Dretzke et al., 2015) and older teachers use student feedback less often (Ditton & Arnold, 2004b). Some findings on *gender* effects regarding feedback show that female teachers seek more feedback from colleagues (Runhaar et al., 2010) and tend to improve their teaching more after a student feedback intervention (Buurman et al., 2018).

Many findings from social cognitive psychology showed s*elf-efficacy* to be a particularly important personal factor in the use of feedback (e.g., Bandura, 1997; Heslin & Latham, 2004; Lyden et al., 2002; Sedikides & Strube, 1995; Stajkovic & Sommer, 2000). This seems obvious, as receivers of critical or negative feedback are more likely to respond with additional effort if the person is convinced of achieving an improvement. In addition, feedback is seen as less threatening to self-esteem if persons are convinced that they could respond productively to criticism. In this way, teachers with a higher self-efficacy seek more feedback and are more willing to reflect upon it (Runhaar et al., 2010). A higher self-efficacy correlates with a positive attitude toward school evaluation results (Schneewind, 2007). Ditton and Arnold (2004b) find a differential effect of teachers' self-efficacy on the improvement of teaching after a student feedback intervention.

Closely related to self-efficacy is the concept of *attribution styles* (Weiner, 1985). People differ in whether they attribute the causes of a performance result or feedback more to themselves (internally) or to other people and circumstances (externally). Additionally, the causes can be seen as stable or variable. Persons further differentiate whether these causes are controllable or not. Since attribution styles are particularly associated with motivational and emotional effects as well as convictions regarding individual freedom for action and options, they are considered relevant for dealing with feedback (Strijbos & Müller, 2014). For example, a change in effort or action which is based on negative or corrective feedback can only take place if the recipient assesses the feedback cause as changeable by him- or herself—this corresponds to an attribution as internal, variable, and controllable. With an internal attribution, the receiver can assume responsibility for his or her own results. If, in addition, the cause of a negative result is regarded as controllable and changeable, adjustment processes can be initiated to achieve the performance target. If the cause of negative feedback is considered to be internal, but stable and not controllable, e.g., attributed to one's own lack of ability, this will not lead to a motivation for change. Furthermore, an attribution to one's own personality can lead to negative, performance reducing affects (Kluger & DeNisi, 1996) and to a weakening of the self-concept (Ilgen & Davis, 2000). In the same way, positive feedback can only increase self-efficacy if the cause is assessed as controllable and internal (Bandura, 1997; Lyden et al., 2002; Tolli & Schmidt, 2008). In order to maintain a positive self-concept, recipients of negative feedback tend toward an external attribution (Korn et al., 2016; Sedikides & Strube, 1995). Studies in the school context identified teachers' attribution styles as crucial for the sensemaking process in the context of student achievement and self-evaluation data use (Bertrand & Marsh, 2015; Schildkamp & Visscher, 2009). For example, teachers who attributed students' achievement to their own instruction as internal, variable, and controllable improved their teaching successfully, while the causal attribution to student or test characteristics inhibited instructional improvement (Bertrand & Marsh, 2015). Although these effects have been much elaborated in psychological research, as far as we know only Tacke and Hofer (1979) analyzed effects of teachers' (*n* = 20) internal and external attributions of received positive and negative student feedback. Their results could not show any associations regarding teachers' improvement of teaching.

As a further relevant personal factor, various studies show significant effects of persons' *goal orientations* on the processing and use of feedback (Elliott & Dweck, 1988). Most research approaches distinguish between: mastery goal orientation, which focuses on the development of competence or task mastery; performanceapproach goal orientation, with a focus on presenting competence relative to others; and performance-avoidance goal orientation, which concentrates on the avoidance of demonstrating incompetence (Elliot, 1999). Performance goal orientations are often linked with the belief that abilities (e.g., intelligence) or competencies are internal, stable, and not controllable. By contrast, people with a high mastery goal orientation tend to assume that their own abilities can be changed or improved, thus tending toward internal, variable, and controllable attribution. Furthermore, studies from organizational psychology point out significant correlations between persons' self-efficacy and mastery goal orientation (Runhaar et al., 2010; VandeWalle, 2001). For the processing and constructive reaction to negative feedback, a high mastery goal orientation proves to be favorable (Elliott & Dweck, 1988; He et al., 2016; Heslin & Latham, 2004), whereas a high performance-avoidance goal orientation often leads to lower performance in this situation (VandeWalle et al., 2001). When feedback is positive, performance-approach goal-oriented persons tend to react with increased effort, whereas mastery goal-oriented individuals retain their performance or show a weaker increase of effort (Cianci et al., 2010; Donovan & Hafsteinsson, 2006).

Studies on teachers' handling of collegial feedback from colleagues or peers also confirm these effects (Funk, 2016; Kunst et al., 2018), whereas teachers with a high mastery and a low performance-avoidance goal orientation showed the highest level of feedback-seeking behavior (Runhaar et al., 2010). Therefore, we assume that motivational goal orientations show similar effects for the teachers' use of student feedback, although the only study we found in this regard (Tacke & Hofer, 1979) did not prove any such effects, albeit using an older conceptualization of general achievement motivation.

Even though goal orientations are largely stable personality traits (Praetorius et al., 2014), they can partly be controlled by prompting effects. If an evaluation or feedback is presented as a learning opportunity, this tends to lead to a higher mastery orientation, whereas the description as a control instrument is associated with a stronger performance-achievement or performance-avoidance goal orientation, depending on whether the person considers him/herself to be competent and capable (Cianci et al., 2010). This could also explain the strong relation of teachers' perceived control purpose of student feedback with their resistance against this instrument and so lower acknowledgment of the feedback (Elstad et al., 2015). Conversely, a perceived developmental purpose of student feedback is linked to a higher appreciation of usefulness (Elstad et al., 2017).

In the context of data-based decision-making, teachers' *data literacy* is mentioned as an important factor for the use of student achievement data (Schildkamp et al., 2017). This means that the teacher must be able to understand the results (data), which are often numerical, draw the right conclusions, and translate them into actionleading steps for their subsequent teaching (Mandinach & Gummer, 2016). Obviously, teachers must also have a kind of data literacy when using student feedback. However, as far as we know, no study results are available in this regard.

Studies investigating the relationships between the Big 5 *personality traits* and the use of feedback show inconsistent and sometimes contradictory findings (Strijbos & Müller, 2014). Some analyses suggest a positive effect of a higher agreeableness and conscientiousness on the feedback use and acceptance (e.g., Bell & Arthur, 2008; Guo et al., 2017). Other findings show a negative effect of extraversion and neuroticism (Smither et al., 2005b), or indeed failed to find any significant correlation (Walker et al., 2010).

Findings on effects of teachers' *stress* experience indicate a negative association with student feedback use (Ditton & Arnold, 2004b). Simultaneously, another analysis indicates that self-evaluations can lead to additional stress experiences among teachers, which in turn lead to a stronger rejection of the procedure and a lower acceptance of student feedback (Elstad et al., 2015).

As many studies from organizational psychology point out, the perceived trustworthiness and competence or expertise of the feedback provider has an important effect on the acceptance and usage of feedback ( Cherasaro et al., 2016; Ilgen et al., 1979; Lechermeier & Fassnacht, 2018; Steelman et al., 2004; Raemdonck & Strijbos, 2013). In the context of student feedback to teachers, this means that teachers' attitudes to student judgment accuracy and trustworthiness have an important impact. Findings of several studies confirm this point (Balch, 2012; Ditton & Arnold, 2004b; Elstad et al., 2017), whereas the analyses of Gaertner (2014) show a positive correlation between skepticism regarding student responses and reported usage of feedback. Especially skepticism about feedback from young students is reflected in studies which focus on the quality of feedback from primary school students (De Jong & Westerhof, 2001). This skepticism may explain the low use of student surveys in primary schools. Gärtner (2010), for example, shows (on the basis of the usage statistics of an online portal for student surveys in the German federal states of Berlin and Brandenburg) that only a few feedback-surveys take place in primary schools (8.3% in grade levels 3 & 4 and 18.7% in grade levels 5 & 6), although about half of all students are taught in primary schools. On the other hand, there is also evidence of the validity of primary school students' perception of teaching (Fauth et al., 2014; Gärtner & Brunner, 2018; van der Scheer et al., 2019). These results prompt the questions: (a) under which circumstances do teachers trust younger students to be able to assess teaching competently? (Igler et al., 2019); or rather (b) which student characteristics influence whether teachers assess their students as competent feedback providers (age, achievement level, socio-economic status, language skills, etc.)? In addition, studies are still lacking on the adjacent question of the extent to which teachers' student orientation is linked with the acceptance and perceived usefulness of student feedback.

# *2.2 School: Relevant Organizational Characteristics*

Several studies summarize different organizational characteristics, such as support for giving and interpreting feedback, a non-threatening atmosphere, or the value of feedback to improve, as *feedback culture* of an organization. The results provide evidence of the importance of this overall concept as moderator for the use of feedback in organizations (Kahmann, 2009; London & Smither, 2002; Mulder, 2013). Also, for the systematic use of student feedback, feedback culture within teaching staff has been found relevant (Gaertner, 2014). In addition, compliance-oriented cultures in schools appeared to hinder developmental use of students' achievement data (Farrell & Marsh, 2016).

However, studies have also shown that specific organizational characteristics have effects on the productive use of feedback. The perceived *team psychological safety*, which means that team members share a belief that the team is safe for interpersonal risk taking, has proved to be relevant for feedback use (Edmondson, 1999; Harvey et al., 2019; Semmer & Jacobshagen, 2010).

In addition, a beneficial approach to increase the productive use of feedback in organizations is to offer special support in understanding feedback, setting goals, and implementing them in practice; this includes coaching, group reflections, and counseling (Luthans & Peterson, 2003; Smither et al., 2003; Walker et al., 2010). In the context of data use at schools, training and support for teachers with regard to data analysis and interpretation has been found instrumental for instructional data use (Farrell & Marsh, 2016; Kerr et al., 2006; Schildkamp & Visscher, 2009). In the context of student feedback, in particular, those intervention studies which provided supportive measures for reflection and teaching development show significantly higher positive effects (see Chap. 9 by Röhl in this volume).

In all of this, *leadership* plays an important role in feedback usage processes. In organizational research, transformational leadership (Bass, 1985) proved to be advantageous for team learning, feedback processes, and reflection in working groups and school teams (Lam, 2002; Runhaar et al., 2010; Tuytens et al., 2019). According to this concept, school principals should provide a clear vision for the future, inspire teachers, give the work a greater sense of meaning, and stimulate the questioning of old assumptions. Findings from research on data-based decision-making processes in schools pointed to the importance of encouragement from principals (Schildkamp & Visscher, 2009) and teachers' feeling of autonomy to make decisions about their instruction in data use processes in schools (Kerr et al., 2006; Prenger & Schildkamp, 2018). Whether teachers interpret the obtaining of student feedback more as a control or as a development opportunity depends on the communication from the school leaders (Elstad et al., 2017; Lejonberg et al., 2017). Active encouragement by school leaders of teachers to seek student feedback is also supportive, as extrinsically motivated feedback use is just as beneficial to reported improvements in teaching as is intrinsically motivated feedback use (Gaertner, 2014).

# *2.3 Feedback Message as Data: Relevant Feedback Characteristics*

As several studies reveal, also the characteristics of the feedback message show relevant effects on the processing and use of feedback (Coe, 1998; Ilgen et al., 1979; Kluger & DeNisi, 1996). In this way, the comprehensibility of feedback results proves to be an important predictor for feedback use both in school performance studies (Groß Ophoff, 2013) and in the context of student feedback (Ditton & Arnold, 2004a; Rösch, 2017). The findings of a study by Merk et al. (2019) on online-based student feedback indicate that teachers feel more confident with the presentation of scale averages than with the display of single values or box plots. However, since the information on the variance of student perceptions as well as the individual item scores contain relevant information about one's own teaching (see Chap. 6 by Schweig and Martínez and Chap. 3 by Röhl and Rollett in this volume), a promotion of teachers' data literacy appears to be an important prerequisite for productive use (Schildkamp, 2019).

Findings from organizational psychology point out that the perception of the valence or positivity of feedback with regard to one's own actions is accompanied by a more precise reception, easier remembering of the feedback contents, and better acceptance of the feedback (Ilgen et al., 1979; Lyden et al., 2002). If feedback is perceived as negative, there is a tendency to adopt a defensive attitude, which serves to protect one's own self-concept, and so reduces the intensity of perception, the acceptance, and the willingness to change (Lechermeier & Fassnacht, 2018; Sedikides & Strube, 1995). In most cases, this defensive attitude is also expressed in an external attribution of the reasons for the negative feedback (see above). These reactions on the feedback valence are also found in the context of student feedback (Ditton & Arnold, 2004a; Rösch, 2017). However, whether the student feedback is more positive or more negative than teachers' self-perception has not been found to be significant for student feedback use (Buurman et al., 2018; Gaertner, 2014).

In organizational research, the specificity of feedback, i.e., the accuracy and extent of the exemplary reference to the task and its improvement, shows differential effects depending on the expertise of the recipient. Highly specific feedback leads to positive effects if the recipient is in an early exercise phase with regard to the task. However, long-term learning performance is negatively influenced by this highly specific feedback (Lechermeier & Fassnacht, 2018). On the other hand, lowspecificity or summarized feedback, which refers to several tasks or a longer period, has the opposite effect: Short-term exercise performance is worse, whereas longterm learning performance is better. A tentative explanation is that low-specificity feedback could lead to an active search for possible improvements and a deeper processing of the necessary information, which in turn leads to deepening learning effects (Schmidt & Bjork, 1992). Since most teachers can be classified as professionals with many years of teaching experience, a less specific student feedback for this group could generate greater usage effects. More specific forms of feedback, which include concrete suggestions for improvement, could be beneficial for novice teachers or in teacher training. In addition, in a discussion following the feedback, students can give important concrete advice on how to improve teaching (see also Chap. 12 by Schmidt and Gawrilow in this volume).

Studies on the effects of *timing* of feedback mainly refer to the accuracy of experimental learning tasks and distinguish immediate feedback from feedback given between 10 min and 24 h after completion of the learning task (Lechermeier & Fassnacht, 2018). The delay retention effect shown here, according to which late feedback leads to higher long-term learning success (Kulhavy & Anderson, 1972), seems to be based on the recapitulation of the associated learning content, which leads to a more in-depth memorization (Smith & Kimball, 2010). Since feedback from students to the teacher is usually only given after a certain time interval from the teaching activities, e.g., at the end of a lesson, week, or learning unit, these findings are only of limited significance for this context. However, Coe (1998) argues for the school context that feedback to teachers on students' learning achievement or teaching should be given as soon as possible in order to have the maximum effect on the further development of teaching.

Furthermore, a survey instrument which is valid and reliable should be selected for a successful use of student feedback (see also Chap. 4 by Bijlsma in this volume). However, it should also suit the age of the students and the type of teaching in which it is to be used. For example, a questionnaire designed for use in the context of selfregulated learning may provide little helpful information if it is used in the context of strongly teacher-directed learning.

# **3 Conclusion and Outlook on Future Practice and Research**

This chapter summarizes existing evidence on the use of student feedback according to three relevant influencing factors: characteristics of the feedback recipient, characteristics of the organization, and characteristics of the feedback information.

The following personal characteristics of teachers which influence the use of feedback were identified: *self-efficacy*, *attribution styles*, *goal orientations*, *perceived trustworthiness and competence of students as feedback providers, data literacy,* and *age and professional experience*. The reported findings indicated relevant characteristics of schools as organizations: *feedback culture*, *leadership*, *safety*, *support measures,* and *perceived function of feedback as control vs. development*. Relevant characteristics of feedback information were identified as: *timeliness, comprehensibility*, *valence*, and *specificity*.

For many of the reported teacher and feedback characteristics only evidence from organizational research exists. Although some findings on teachers' use of feedback from colleagues or school leaders point to a transferability of results found to the school context, there are no or only rare studies which would confirm the results for the context of student feedback. Regarding the characteristics of schools as organizations, a little more is known from data use studies, but also only a few findings concerning student feedback exist. With regard to future research, we believe that there is a particular need for complex intervention studies which examine the effectiveness of student feedback in teaching development while controlling factors identified as relevant.

The findings presented in this chapter reveal a number of indications for the beneficial use of student feedback in schools. Firstly, with regard to organizational conditions, it seems helpful to communicate student feedback as a learning opportunity for teachers and not as a control instrument. The school management should ensure a safe environment in order to realize the use of student feedback, especially in a collegial setting, and thus build up a feedback culture in the long term. Transformational and feedback-encouraging leadership can help to enable reflective and developmental processes in schools overall and so foster productive feedback use. Finally, and in the best case, student feedback can be implemented in such a way that, at the same time, support measures are in place for the joint development of teaching.

Positive experiences in dealing with student feedback can thus possibly also change teachers' attitudes toward student feedback, such as the perceived trustworthiness of students as feedback providers (Gärtner & Vogt, 2013). In addition, this could also have a positive influence on relevant personality traits such as self-efficacy, attribution styles, and goal orientations.

With regard to the preparation of reports from student feedback, it seems helpful to make it as comprehensible as possible, especially with regard to statistical parameters (Merk et al., 2019), but also to include information about heterogeneous views among students on individual aspects of teaching. Furthermore, it appears to be beneficial for the developmental use of feedback to enrich reports with concrete suggestions for improving teaching activities (specificity), especially for less experienced teachers. Moreover, positive results should be particularly emphasized, so that negative results can also be better accepted.

# **References**


**Sebastian Röhl** has been an Academic Assistant in the Department of Educational Science at the University of Education Freiburg and is currently a Postdoctoral Researcher in the Institute of Education at Tübingen University (Germany). Before that he worked for more than 10 years as a grammar school teacher, school development consultant, and in teacher training. Among other areas, he conducts research in the fields of teaching development and teacher professionalization through feedback, social networks in inclusive school classes, as well as teachers' religiosity and its impact on professionalism. In addition, he is the Director of an in-service professional master's study program for teaching and school development.

**Holger Gärtner** is Scientific Director of the Institute for School Quality (ISQ) as well as Professor at the department for the Evaluation of School and Teaching Quality at the Freie Universität Berlin (Germany). After obtaining his doctorate in psychology, he initially worked at the ISQ as a project manager responsible for projects on the internal and external evaluation of schools. He conducts research on questions of data-based decision to support school and teaching development, including the impact of internal and external evaluation of schools.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 11 Student Feedback as a Source for Reflection in Practical Phases of Teacher Education**

# **Kerstin Göbel, Corinne Wyss, Katharina Neuber, and Meike Raaflaub**

**Abstract** The chapter focuses on the use of student feedback on teaching during practical phases in teacher education. After a brief introduction into the general relevance and validity of students' perceptions on teaching, and on the use of student feedback for teaching development, core findings from two comparable quasiexperimental studies from Germany and Switzerland are presented in detail. The studies focus on the change of attitudes towards student feedback and towards reflection on teaching. The chapter concludes with a discussion of challenges and opportunities for the use of student feedback as an instrument for reflection on teaching and professional development for pre-service teachers.

**Keywords** Teacher education · Reflection · Practical phases · Validity of student feedback · Quasi-experimental studies

# **1 The Relevance and Validity of Students' Perceptions**

Teaching in class is a complex situation as teachers have to master many different tasks at the same time (Bromme, 2014; Stürmer et al., 2017). In this context, receiving feedback on their behaviour can be particularly helpful for teachers, as it expands their perspectives in a meaningful way and might give insights into the teaching process (Helmke, 2015). For teachers and even more for pre-service teachers, it is

K. Göbel (B) · K. Neuber

K. Neuber e-mail: katharina.neuber@uni-due.de

C. Wyss FHNW School of Education, Brugg, Switzerland e-mail: corinne.wyss@fhnw.ch

M. Raaflaub University of Teacher Education, Bern, Switzerland e-mail: meike.raaflaub@phbern.ch

Faculty of Educational Sciences, University of Duisburg Essen, Essen, Germany e-mail: kerstin.goebel@uni-due.de

difficult to process relevant information during teaching in class. In order to counteract restricted and possibly self-serving perspectives, student feedback may offer a specific perspective which may hold further information on teaching and learning processes relevant in the classroom (Clausen, 2002; Clausen et al., 2020; Hascher et al., 2004).

The relevance of student perceptions on teaching is apparent by the very fact that students and their learning are targets of teaching, and as such, students can refer to their experiences with different subjects and teachers. Hence, their observation of the teaching and learning process may contain highly relevant information for teachers. Concerning empirical results on student perceptions of teaching quality, studies in primary and secondary education reveal factorial validity of student ratings. They conclude that students are capable of differentiating between various aspects of teaching quality, such as classroom management, motivational quality and teaching clarity (Fauth et al., 2014; Lenske, 2016; Wagner et al., 2013). Furthermore, several studies point at the predictive validity of student ratings as student perceptions of teaching are linked to learning outcomes: Studies in mathematics reveal a correlation between classroom management, goal clarity and support for autonomy for students' mathematical learning and their self-concept or interest in mathematics (Clausen, 2002; Kunter et al., 2007; Wagner et al., 2016). A large-scale study on English learning in secondary schools shows a correlation between classroom climate, motivational quality and clarity as perceived by students with their development of listening comprehension in the course of one school year (Helmke et al., 2008). Moreover, intercultural learning outcomes in EFL (English as a Foreign Language) secondary classes could be predicted with students' perception of specific aspects of teaching quality, such as a positive error culture and classroom management (Göbel & Hesse, 2008). In some studies, the predictive validity of student ratings is even higher for the prediction of learning outcomes than expert or teacher ratings (Fauth et al., 2014; Göllner et al., 2016; Wagner, 2008).

While there is empirical evidence for predictive and factorial validity of student ratings on teaching, current studies also point at limitations when it comes to gathering information on teaching quality by student ratings. In a German interview study, 14 secondary school students were confronted with their ratings on teaching quality and asked to explain the reasons for their feedback on each of the rated items (Lenske & Praetorius, 2020). Interviews with these students revealed that they did not fully understand all items of the implemented questionnaire, although it was an instrument which had been validated in former studies. Another study by Röhl and Rollett (2020) examined data from a student survey administering student feedback questionnaires on teaching quality (*N* = 860). Their analyses on factorial validity point at halo effects of teachers' communion (community orientation) for different teaching quality ratings.

Although student feedback might be fraught with uncertainty due to problems of validity and reliability, it represents a special perspective on the teaching process and provides teachers with important orientation information on their teaching (Clausen & Göbel, 2020). Studies on the use of student feedback on teaching of in-service teachers point at a positive impact on teaching development in terms of the teacher– student relationship and a more sophisticated view on the needs of students (Ditton & Arnoldt, 2004; Gärtner, 2013; Rösch, 2017). Analyses of in-service teachers using student feedback point at the relevance of teacher–student co-construction of the meaning of student feedback in class for a better understanding of students' ideas (Gärtner & Vogt, 2013). Furthermore, the positive effect of student feedback seems to depend on teachers' attitudes towards student feedback, attitudes towards cooperation, teachers' stress experience and the quality of student feedback (Gärtner, 2013; Ditton & Arnold, 2004).

In practical phases of teacher training, student feedback has the potential to bring about changes in the attitudes of future generations of teachers, so that they can use feedback—being aware of the challenges and problems of this information—for continuous reflection and development of their teaching (Clausen & Göbel, 2020). Pre-service teachers can consider student feedback on teaching in addition to feedback from in-service teachers or lecturers during practical phases. However, the use of student feedback for learning and reflection processes during practical phases in teacher education is still rare (Hascher et al., 2004) and the students' perspective on pre-service teachers' teaching and professional development has been scarcely investigated empirically (Lawson et al., 2015). Therefore, the following sections seek to shed light on present research and findings in the field of student feedback in teacher education.

In the following, we present empirical results on the implementation of student feedback in teacher education. After giving an overview on international results on the topic, we present two comparable quasi-experimental studies from Germany and Switzerland which focus on the change of attitudes towards student feedback and towards reflection on teaching. The two studies are interconnected as they are similar in research design and make use of the same instruments to evaluate attitude changes in the course of student feedback use in practical phases of teacher education. At the end of the contribution, the chapter concludes with a discussion of challenges and opportunities for the use of student feedback as an instrument for reflection on teaching, and professional development of pre-service teachers.

# **2 Empirical Results on Student Feedback for Reflection on Teaching in Teacher Education**

Although there are several hints at the relevance of student feedback for teaching improvement as well as claims for their integration into teacher education, the number of empirical studies focusing on student feedback use in teacher education is still limited (Lawson et al., 2015). The work on student feedback in teacher education started in 1942 when Porter published a paper on an exploratory study on this topic. Analyses from a questionnaire focusing on characteristics of pre-service teachers revealed a close agreement between the ratings of students and supervisors. Preservice teachers evaluated the feedback of their students as beneficial and their respective students reported that they appreciated being part of the evaluation process (Porter, 1942).

# *2.1 Systematic Settings and Measurement Problems*

In 1969, Lauroesch and colleagues investigated the use of student feedback by preservice teachers from the University of Chicago to assess the impact of student feedback on the teaching of pre-service teachers. The quality of pre-service teachers' instructional practice during the internship was measured two times using student ratings. The findings of this quasi-experimental study indicate that the provided summary of the student ratings may not be sufficient to encourage future teaching activities of the pre-service teachers. At the second time of measurement the teaching quality of those pre-service teachers who received a summary of student ratings of their lesson was rated even less positively than before (Lauroesch et al., 1969). The authors conclude that the feedback was potentially misunderstood or that preservice teachers were overburdened to use the feedback constructively and change processes in teaching. These findings might hint at the need for implementing systematic settings for the reception and reflection of student feedback to provide preservice teachers with concrete starting points for development in teaching. Possibly, it might not be a lack of development in teaching, but a problem of measurement. For students it might be difficult to assess changes in teaching quality. A study by Holtz and Gnambs (2017) points at the fact that student feedback could be problematic for the assessment of changes in instructional quality. They measured the teaching quality of 181 pre-service teachers in a 15-week internship at a secondary school in Thuringia (Germany) using three different rating sources (self-assessment, mentors' assessment and student ratings). The findings indicate differences in change scores between the three rating sources: Pre-service teachers themselves and their mentors perceived larger changes in instructional quality than students. Similar findings have been reported in a study by Biggs and Chopra (1979) where changes in teaching quality could not be detected by student ratings.

# *2.2 Constructive Feedback for Instructional Development*

In the course of an exploratory study in France, Genoud (2006) implemented student feedback in the course of teacher training focusing on the classroom climate in class using the TIP—questionnaire (Trainee Interaction Profile; Wubbels & Levy, 1993). In a sample of approximately 50 pre-service teachers and their students from grade 5 and 6 a TIP questionnaire was implemented in order to show differences between pre-service teachers' self-assessments, those of their students and their training supervisor. The intervention was evaluated positively by the pre-service teachers and their students. Pre-service teachers reported a positive perspective towards the use of student feedback on teaching for their professional development during initial teacher training.

A further exploratory study by Snead and Freiberg (2019) examined the use of Freiberg's Person-Centered Learning Assessment (PCLA; Freiberg, 1994–2017) for reflecting and developing instructional practice of 10 pre-service teachers in the United States. The pre-service teachers reported that changes in their teaching as a result of using PCLA occurred mostly in areas of planned instructional changes like engagement, levels and types of questioning, and teacher-to-student communication. Although the use of PCLA has the potential to lead to deeper levels of self-reflection and changes in teaching, further qualitative analyses of pre-service teachers' reflections on the implementation of student feedback (as a component of PCLA) showed that the quality and quantity of student feedback was heterogeneous. The authors therefore propose that in order to derive more relevant information, it would be helpful to teach students how to provide constructive feedback for instructional development.

A qualitative case study focusing on pre-service teachers' experiences with the use of feedback from different sources (teachers, faculty supervisor, peers and students in class) during their school internship was carried out by Tulgar (2019). The study examines written feedback reports from 28 pre-service teachers in Turkey. After using different sources of feedback, the participants reported development in different areas of their own professional competence, such as self-reflection, self-regulation by identifying strengths and weaknesses, evaluation of teaching performance, reflection on stress-related experiences and their planning of future lessons.

# *2.3 Summary*

The presented studies in this chapter reveal a positive attitude of pre-service teachers towards student feedback, also the respective students seem to appreciate the use of student feedback. Although different instruments have been used, they all appear to have a positive impact on pre-service teachers' professional development concerning different areas of reflection on their professional actions. While student feedback is positively evaluated by pre-service teachers in general, the quality and quantity of student comments on the lesson are perceived as heterogeneous. Therefore, it is not surprising that the measurement of change in teaching quality by using student ratings is not consistent and seems problematic. In the presented studies a systematic variation in reflection settings to support reflection has not been addressed. In the following sections, two studies are presented in more detail, as they are investigating the relevance of different reflection settings when using student feedback in teacher education.

# **3 Studies in Germany and Switzerland**

# *3.1 Concept and Main Findings of the ScRiPS-Study (Germany)*

#### **3.1.1 Introduction**

Positive attitudes and the willingness to engage in self-reflection are considered central competences in the teaching profession; thus, an open attitude towards reflection of one's own teaching and pedagogical actions should be promoted in teacher training (Svojanovsky, 2017). The ScRiPS-study (*Schülerrückmeldungen zum Unterricht und ihr Beitrag zur Unterrichtsreflexion im Praxissemester* / The use of student feedback for reflection upon teaching during practical term) is an intervention study carried out at the University of Duisburg-Essen, North-Rhine Westphalia, Germany (Göbel & Neuber, 2017, 2019; Neuber & Göbel, 2019) and aims at supporting and analyzing the reflection on teaching with the use of student feedback in teacher training. In North-Rhine Westphalia (Germany), the first phase of teacher education is provided by universities in a Bachelor–Master structure. This first phase is mostly theoretical, addressing content knowledge, pedagogical knowledge and pedagogical content knowledge. Furthermore, two practical terms are integrated. The first practical term is an internship in schools at the beginning of the Bachelor program (duration: 5 weeks). The second internship is placed at the beginning of the Master program and lasts around 5 months. The aim of this internship in schools is to gain first experience in teaching, to reflect on practical experience and to link theoretical knowledge with practical experience. The second phase of teacher education is a mostly practical one which is realized in schools and guided by the centres for practical teacher training. The ScRiPS-study seeks to support and analyze the reflection of pre-service teachers during the 5-month practical phase of the Master program and the reflection of in-service teachers in schools when using student feedback. Changes in attitudes of pre-service and in-service teachers towards reflection and student feedback have been investigated.

#### **3.1.2 Method**

The study included 164 pre-service teachers (in the 5-month practical phase of the Master program, see above) from the University of Duisburg-Essen and 106 inservice teachers (Göbel & Neuber, 2020). The participants of the intervention groups were asked to implement student feedback on their teaching. As student feedback, a written feedback form which consisted of three open-ended questions about the quality of the lesson (What did you like about the last lesson? What did you not like about the last lesson?What could be improved for the next lesson?) was implemented. Furthermore, standardized questionnaires with a focus on either classroom management (e.g. Gruehn, 2000), classroom climate (e.g. Rakoczy et al., 2005) or cognitive activation (e.g. Baumert et al., 2009) were provided to gather feedback from students. Both groups of teachers (pre-service and in-service) used the open-ended feedback questionnaire and could decide about the further standardized feedback questionnaire they wanted to use. The received student feedback was evaluated by the pre-service and in-service teachers individually and then discussed with the students in class.

The in-service teachers implemented student feedback on their lessons but were not further supported in the reception and reflection of the feedback. For preservice teachers, the use of student feedback was investigated in a quasi-experimental control-group design with three intervention groups (IG) (Göbel & Neuber, 2017; Neuber & Göbel, 2019). Pre-service teachers of intervention group 1 (*n*IG1 = 22) obtained student feedback on their lessons but did not receive further support for reflection. Pre-service teachers of intervention group 2 (*n*IG2 = 32) and 3 (*n*IG3 = 33) received individual support for reflection in the form of a reflective journal entry which was developed in the ScRiPS-project. The reflective journal entry contains a catalogue of questions (prompts), which should enable a deeper reflection of the feedback results (Hübner et al., 2007) and refer to the lesson as well as to the results of the student feedback. The pre-service teachers of intervention group 3 also reflected on the student feedback in a collegial setting (peer reflection in tandems) at the University. To structure the collegial reflection setting, pre-service teachers could use the materials provided in the form of reflective questions and their reflective journal entries. The pre-service teachers of the control group did not use student feedback, reflective journal or collegial setting during their practical term. A total of 87 preservice teachers were assigned to the intervention groups (use of student feedback and written or collegial setting during practical phase); 77 pre-service teachers were not assigned to any feedback-based reflection setting during practical phase (control group).

The use of student feedback was empirically investigated with regard to changes in attitudes of pre-service and in-service teachers towards reflection upon teaching. The attitudes of pre-service and in-service teachers towards reflection and student feedback were measured before and after the student feedback intervention via standardized questionnaires. The scales regarding the attitudes towards different forms of reflection, e.g. reflective journals or collegial settings, and towards the use of student feedback as a reflection stimulus, were formed by averaging the respective questionnaire items and proven to have acceptable reliability (Neuber & Göbel, 2018). All items are answered by using 4-point Likert scales which range from 1 ("I fully disagree") to 4 ("I fully agree"). Differences between groups and changes in attitudes were analyzed with unpaired and paired t-tests and by conducting repeated measures ANOVA. In order to examine correlations between the pre-service teachers' attitudes and their motivational preconditions, the motivation to study (Kauper et al., 2012) as well as the stress experience (Schwarzer & Jerusalem, 1999) were measured via standardized questionnaires. Furthermore, within the framework of a partial study of the ScRiPS-project, the personal experiences of the pre-service teachers with the use and reflection of student feedback on their own teaching were examined. The interviews were evaluated using qualitative content analysis.

#### **3.1.3 Results**

Looking at the results, pre-service teachers report fundamentally positive attitudes (Mean *M* > 2.5 in the 4-point Likert scale) towards reflection of teaching and student feedback (Göbel & Neuber, 2017). In addition, a high acceptance of the use of student feedback as well as the use of written and collegial forms of reflection during practical term can be shown (*M* > 2.5; Neuber & Göbel, 2020). The comparison of the different intervention groups showed that the pre-service teachers who were systematically supported in the reception and reflection of the student feedback (intervention groups 2 and 3) assessed the use of student feedback slightly more positively (*M*IG2 = 3.29, *SD*IG2 = 0.41; *M*IG3 = 3.30, *SD*IG3 = 0.42) than pre-service teachers without written or collegial reflection support (*M*IG1 = 3.18; *SD*IG1 = 0.43). However, there are no significant differences between the intervention groups in the assessment of the use of student feedback (*p* = .521). Furthermore, pre-service teachers who reflected on their own teaching both individually and in a collegial manner (intervention group 3) continue to assess the collegial form of reflection (*M* = 2.86; *SD* = 0.72) as being slightly more helpful for reflecting the student feedback than the written reflection sheet, which was used individually (*M* = 2.78; *SD* = 0.62).

In a comparative sub-study, the attitudes of 53 pre-service and 51 in-service secondary school teachers were compared (Göbel & Neuber, 2020). In the pre-test survey both pre-service (*M* = 3.24; *SD* = 0.36) and in-service teachers (*M* = 3.20; *SD* = 0.50) consider reflection on their own teaching to be important; the participants also have positive attitudes towards student feedback (*M* > 2.5). The two groups differ neither in the perceived relevance of reflection (*p* = 0.605) nor in the attitude towards student feedback (*p* = 0.196). The analysis indicates that pre-service teachers (*M* = 3.04; *SD* = 0.55) perceive structured reflection formats to be more helpful than in-service teachers (*M* = 2.70; *SD* = 0.55; *p* = .002). The same is true for collegial reflection formats; again, the analysis indicates a significant difference between the attitudes of pre-service teachers (*M* = 3.42; *SD* = 0.42) and the attitudes of in-service teachers (*M* = 2.88, *SD* = 0.57; *p* < .001). Furthermore, pre-service teachers (*M* = 1.93; *SD* = 0.37) are more critical of individual reflection settings than in-service teachers (*M* = 2.29; *SD* = 0.54; *p* < .001), although both groups tend to reject individual forms of reflection (*M* < 2.5). After using student feedback on teaching, both pre-service teachers (*M*T1 = 3.32; *SD*T1 = 0.36; *M*T2 = 3.39; *SD*T2 = 0.42) and in-service teachers (*M*T1 = 3.20; *SD*T1 = 0.56; *M*T2 = 3.27; *SD*T2 = 0.56) showed a slight increase in positive attitudes towards student feedback (within-subjects effect of time *F*(1, 101) = 4.221, *p* = .043, η<sup>2</sup> = 0.040). After finishing the internship (*M*T2 = 3.34, *SD*T2 = 0.35) the perceived relevance of reflection slightly increases for preservice teachers compared to the time before the internship (*M*T1 = 3.24, *SD*T1 = 0.36, *p* = .036). Moreover, pre-service teachers are more critical regarding the use of written structured forms of reflection after finishing the internship (*M*T1 = 3.04, *SD*T1 = 0.55; *M*T2 = 2.88, *SD*T2 = 0.61, *p* = .048). For in-service teachers, however, no statistically significant changes in attitudes towards reflection are apparent.

Further analyses indicate that motivational preconditions of pre-service teachers are important for the use and reflection of student feedback (Göbel & Neuber, 2017). Accordingly, the analyses reveal a positive correlation between pre-service teachers' attitudes towards student feedback and their motivation to study (Pearson's *r* = .30, *p* = .008) as well as with their positive stress experience (experience of challenge in teaching profession; *r* = .40, *p* < .001). The findings of the qualitative substudy on pre-service teachers' experiences indicate that, in addition to motivational preconditions, organizational aspects of the use of feedback, e.g. arrangements with participating teachers, as well as time resources and characteristics of the students, are also important for the yield of feedback use and reflection (Neuber & Göbel, 2020). Collegial opportunities for reflection are perceived as being more helpful by pre-service teachers than individual forms of feedback reflection. In particular, the joint reflection of feedback with the students is considered as helpful by the preservice teachers. However, pre-service teachers report differences between students of different grades in terms of their experiences with feedback and the information content of student feedback, which plays an important role in the yield of classroom reflection and thus in actual changes in teaching.

#### **3.1.4 Summary**

The findings of the ScRiPS-study show that both pre-service and in-service teachers confirm their positive attitudes towards the use of student feedback and reflection in general. The analyses for the pre-service teachers show that motivational preconditions are important for positive attitudes towards reflection. Additionally, time resources and characteristics of the student feedback seem relevant for the effective implementation of student feedback during practical phases. Collegial opportunities for reflection are perceived to be more helpful by pre-service teachers than individual forms for the reflection of feedback; in comparison in-service teachers also estimate collegial reflection positively, but not to the same extent as pre-service teachers. In future analyses differences in attitudinal changes between pre-service teachers who systematically used student feedback during practical phases and those who did not use student feedback (control group), will be examined.

# *3.2 Concept and Main Findings of the Study SelFreflex (Switzerland)*

#### **3.2.1 Introduction**

In Switzerland, the training of teachers is mostly provided by universities of teacher education and is organized in a Bachelor-Master structure. The training includes different disciplines and addresses content knowledge, pedagogical knowledge and pedagogical content knowledge. Special attention is paid to a practiceoriented curriculum that combines theory and practice by allowing students to gain practical experience from the very first semesters of study. In the practical phases, students have the opportunity to observe the teaching of in-service teachers and peers as well as to teach students in a classroom. These experiences are reflected at the university in order to link the practical experience with theoretical knowledge. In the project "Student feedback to promote teaching reflection" (Schülerrückmeldungen zur Förderung der Unterrichtsreflexion, SelF*reflex*) pre-service teachers at the Zurich University of Teacher Education in Switzerland gathered student feedback for reflection during their practical training. The intervention study was conducted with 235 students of lower secondary education (grades 7–9). The project was integrated into a 7-week practical phase which usually takes place in the 6th semester of 9 semesters. Before participating in the project, students had already completed 4 practical training phases. In the first year of study they completed two day placements and a block internship of 3 weeks duration, in the second year another block internship of 2 weeks duration. The data were collected with two samples of pre-service teachers in 2017 (*n2017* = 115) and 2018 (*n2018* = 120). As a reference group, the data of 20 in-service teachers were collected.

#### **3.2.2 Method**

At the beginning of the semester, pre-service teachers were asked about their attitudes towards student feedback and towards reflection by means of an online questionnaire (pre-test). The pre-test survey and other instruments used in the study were taken from the project ScRiPS (see above) and adapted for the project SelF*reflex*. After the pre-test the pre-service teachers received an input on the opportunities and goals of working with student feedback and were given the assignment to gather feedback from their students. During the practical term, pre-service teachers received feedback about their lessons from their students at two points in time. They could choose from three pre-defined questionnaires on the following aspects of teaching quality: classroom climate, classroom management and cognitive activation (see Sect. 3.1.2). In addition to the feedback received from their classes the pre-service teachers assessed their own lesson through self-evaluation. The comparison of the perspectives and the resulting consequences were expected to be discussed with students.

A group of 100 pre-service teachers reflected the findings from student feedback with an individual reflective journal entry (see Sect. 3.1.2). The reflective journal guides pre-service teachers towards a systematic reflection of a lesson while taking into account the student feedback. The reflective journal entries of all students were collected and analysed by means of qualitative content analysis (Mayring, 2015). A group of 130 pre-service teachers initially processed the student feedback together with a peer, who had observed the respective lesson, by means of collegial reflection. This group of pre-service teachers completed the individual reflective journal entry after they had received and discussed additional feedback from their peers. The feedback discussion was structured around the results of the student feedback, the pre-service teacher's self-evaluation and the peer evaluation.

After completing the practical phase, a post-test survey was conducted using an online questionnaire. Similar to the pre-test, the post-test survey focused on the attitudes towards student feedback and reflection. In addition, items on experiences with student feedback were added to the questionnaire. Differences between groups and over time were analysed by using unpaired and paired *t*-tests. The lower secondary students were likewise asked about their experiences in a final survey. A short questionnaire was used to obtain their ratings on the usefulness of student feedback and on noticeable changes in the classroom. With selected lower secondary students, as well as pre-service teachers, semi-structured interviews were additionally conducted at the end of the practical phase.

#### **3.2.3 Results**

Based on the pre- and post-test survey of pre-service teachers (*N* = 235) it is apparent that pre-service teachers consider the engagement with student feedback to be very valuable, both before and after the practical phase. However, the agreement in the post-test survey is significantly lower (*M*T1 = 3.29, *SD*T1 = 0.46; *M*T2 = 3.18, *SD*T2 = 0.46) than in the pre-test (*p* = .005). The relevance of reflection is also rated as high whereby significant differences between the pre-test and post-test survey become visible (*p* = .039). After finishing the internship (*M*T2 = 3.05, *SD*T2 = 0.45) the perceived relevance of reflection increases for pre-service teachers compared to the time before the internship (*M*T1 = 2.99, *SD*T1 = 0.49).

The pre-service teachers consider collegial reflection to be very helpful. In the pre-test survey pre-service teachers rate the usefulness of peer reflection as high with a mean of 3.16 (*SD*T1 = 0.49). Interestingly, there is a difference between male and female participants in this respect. Female pre-service teachers hold more positive attitudes towards collegial reflection (*n* = 133, *M*T1 = 3.23, *SD*T1 = 0.50) than male pre-service teachers (*n* = 102, *M*T1 = 3.08, *SD*T1 = 0.47, *p* = .028). The pre-service teachers are generally open to sharing thoughts and information about their own teaching with others, rating the preference of individual reflection rather low (*M*T1 = 2.01, *SD*T1 = 0.50). However, the preference of individual reflection increases after the end of the internship (*M*T2 = 2.09, *SD*T2 = 0.54, *p* = .029).

The results of the qualitative data show that although pre-service teachers who worked with a peer highly value peer discussions, the perceived usefulness depends on various factors, such as the composition of the peer constellation. Pre-service teachers report in the interviews that collegial reflection with a peer is only beneficial if the peer shares a similar attitude towards teaching. Analyses of the peer discussions also show that critical aspects of teaching are rarely addressed (Raaflaub et al., 2019). It appears that peer discussions serve above all to positively confirm the student's own lesson reflection. In the discussion, the reflection partner serves primarily to mitigate potentially problematic aspects and to show solidarity with the pre-service teacher's problems.

In further analyses it became clear that the usefulness of student feedback also depends on the class, especially with regard to school level. In the interviews preservice teachers report that the implementation of student feedback through questionnaires had differing outcomes depending on school level and grade. This estimation is supported by the findings of the final survey of the lower secondary students. The results show that students at a higher school level (*N* = 1249, *M* = 3.19, *SD* = 0.84) consider it significantly more important to give their opinions on lessons to their teachers than students at a lower school level (*N* = 81, *M* = 2.99, *SD* = 0.92; *p* = .038). Students at a lower school level also seem to have greater difficulty in completing questionnaires as a feedback instrument (Wyss et al., 2019). It should be noted that the different sample sizes may limit the interpretation of these results.

#### **3.2.4 Summary**

With respect to tangible results regarding the contribution of student feedback to the promotion of teaching reflection, the evaluation of the pre-service teachers' reflective journal entries shows that they predominantly evaluate their own lessons positively (Wyss et al., 2020). It is noticeable that they primarily mention aspects that can be easily observed from the outside and can therefore be positioned on the surface structure of the lesson. However, aspects that concern the deep structure of the lessons are rarely addressed. The pre-service teachers also report that the majority of the students perceive the lessons positively. When pre-service teachers were asked to compare the different perspectives, some mentioned that the perceptions were very similar whereas others noticed differences. For perceived commonality of ratings, they explain that they feel relieved that the majority of students adopted a positive attitude towards their lessons and that their self-perception is confirmed. Differences in perception are mainly attributed to different roles and interests and are thus perceived as inherent to the subject matter of teaching and to a lower extent as changeable features within lessons.

# **4 Discussion and Conclusions**

The reported studies reveal a positive estimation of pre-service teachers towards the use of student feedback. The results support the assumption that student feedback in teacher training may be helpful to engage reflection on teaching and professional development of pre-service teachers (Tulgar, 2019). Furthermore, studies show that student feedback is evaluated positively by respective students (Porter, 1942) and may have a positive impact on teacher–student relationships (Genoud, 2006). However, pre-service teachers report that student feedback is perceived as heterogeneous (Neuber & Göbel, 2020; Snead & Freiberg, 2019; Wyss et al., 2019) and not yet treated as a valid source for the measurement of change in teaching quality (Holtz & Gnambs, 2017; Lauroesch et al., 1969). Therefore, a need for development of students' feedback competence is articulated by different authors.

The studies on student feedback in teacher education discussed in the first two sections of this chapter mostly have an exploratory design and do not address the reflection process in an explicit way. In contrast, the ScRiPS-study and the SelF*reflex*study provide more information on different reflection settings and on the yield of student feedback for teaching reflection of pre-service teachers. In both, the German and the Swiss study, pre-service teachers positive attitudes towards the use of student feedback and towards reflection on teaching in general (Göbel & Neuber, 2017). The use of student feedback itself as well as collegial and written reflection formats are also positively evaluated (Neuber & Göbel, 2020; Raaflaub et al., 2019). The implemented collegial reflection settings and reflective journal entries seem to offer support for the reflection process. For an effective implementation of student feedback on teaching, it seems necessary that all participants (pre-service teachers and students) agree on the reflection formats to be used. Positive attitudes, motivation and volition of pre-service teachers are important for an effective implementation of student feedback. The results further point to the relevance of professional experience (inservice vs. pre-service teachers in ScRiPS) as well as gender (in SelF*reflex*). In the German sample pre-service teachers show more positive attitudes towards collegial reflection formats than in-service teachers; in the Swiss sample collegial reflection formats are more strongly preferred by female than by male pre-service teachers (Göbel & Neuber, 2020; Wyss et al., 2020).

Summing up the different findings, the use of student feedback in teacher education requires further investigation inlcuding the development of feedback instruments for different classes and school levels and furthermore concepts for reflection and time resources. For the development of pre-service teachers' reflection on student feedback, discussions between teachers and students on feedback results seem particularly promising. In these discussions open questions concerning the student feedback results can be clarified, alternative courses of action for teaching can be developed and students may get a feeling of participation and appreciation. It is important to consider that in general, both pre-service teachers and their students, might have little experience in giving and receiving feedback on teaching. Furthermore, pre-service teachers should be systematically trained and supported in the reception and reflection of student feedback while students should be trained in using the survey instruments adequately to provide helpful feedback on teaching. In the light of possible restrictions of students when giving feedback, their training of feedback competence could be a focus for further research. For future implementation of student feedback in teacher education, it is important to generate more evidence to understand better which personal prerequisites and which institutional conditions are important for a constructive use of student feedback. Furthermore, reflection on student feedback is unlikely to have an impact on classroom changes without additional support as insights gained by student feedback might not directly be translatable into teaching development. Therefore, further research is needed on different reflection concepts and settings to identify those conducive to the reflection process for pre-service teachers and their respective students.

# **References**


**Kerstin Göbel** is a Professor of Educational Research at the University of Duisburg-Essen (Germany). Previously she worked as a researcher and lecturer in the field of Educational Science at the University of Wuppertal and at the Leibniz Institute for Research and Information in Frankfurt am Main (Germany). Her research focuses on instructional development, reflection in teacher education, intercultural and interlingual learning in classroom teaching, as well as acculturation and school engagement.

**Corinne Wyss** is a Professor at the FHNW School of Education and Holder of the Chair for Studies on Professional Practice and Professionalism (Switzerland). From 2006 until 2015 she was a lecturer, and from 2015 until 2019 Holder of the Chair of Teacher Education Research at the Zurich University of Teacher Education (Switzerland). Her work focuses on professionalization in the teaching profession, reflection and feedback processes, video-based classroom research and teacher training, eye tracking and augmented reality in teaching and learning.

**Katharina Neuber** After studying educational sciences with a focus on empirical educational research and quality management, Katharina Neuber has been working as a researcher (PhD) at the University of Duisburg-Essen (Germany) since 2014. Her research focuses on the use of student feedback, reflection on teaching, practical phases during teacher education, as well as teacher stress and well-being.

**Meike Raaflaub** is a former teacher in lower and higher secondary education. She currently works as a researcher and lecturer for teaching methodology at the University of Teacher Education in Bern (Switzerland). Her research focuses on teaching methodology for lower secondary education, student engagement, digitalization in the language learning classroom, and the use of student feedback on teaching.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the Chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the Chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 12 Reciprocal Student–Teacher Feedback: Effects on Perceived Quality of Cooperation and Teacher Health**

## **Jan-Erik Schmidt and Caterina Gawrilow**

**Abstract** High lesson quality in schools is, in addition to other factors, the result of good cooperation between teachers and students. The long history of research on offer-use models of lesson quality and student–teacher relationships documents this interaction. Feedback focused on expressing the quality of cooperation can lead to higher quality of cooperation. The fact that feedback is reciprocal, from teacher to student and vice versa, helps to avoid effects of perceived injustice and rejections of feedback which otherwise are severe obstacles to the efficient use of feedback. High-frequency applications of feedback allow for the timely detection of (positive and negative) critical fluctuations of cooperation between individuals and groups and for the monitoring of processes of adaptation, as shown in other areas of applied psychology. This chapter describes the theoretical parameters of such a feedback method for students and teachers, and outlines results of an empirical study on the effects of the reciprocal method on (1) perceived quality of cooperation and (2) teacher health. Results show that, subsequent to a three-month period of reciprocal feedback, the quality of cooperation as perceived by both students and their teachers increases significantly and teacher health scores improve significantly. Reciprocal feedback techniques should be considered in teacher education and teacher training as a way to help teachers to initiate processes of improvement of lesson quality.

**Keywords** Feedback · Student–teacher interaction · Cooperation · Co-regulation · RCT

J.-E. Schmidt (B)

C. Gawrilow

Center for School-Quality and Teacher Education, Tübingen, Germany e-mail: jan-erik.schmidt@zsl.kv.bwl.de

Department of Psychology, University of Tübingen, Tübingen, Germany e-mail: caterina.gawrilow@uni-tuebingen.de

# **1 Introduction**

One of the most general definitions of feedback states that feedback is "information about the gap between the actual level and the reference level of a system" (Ramaprasad, 1983, p. 4). Going beyond this general definition, there is a need for a structured overview of the many different forms of feedback which have been suggested for fostering development in schools. For this, the five main characteristics of each kind of feedback should be made clear: (a) the source of feedback, (b) the recipient, (c) the topic, (d) the method, and (e) the frequency of the feedback (Hattie & Wollenschläger, 2014; Kluger & DeNisi, 1996; Mikula et al., 1990).

The most common form of feedback in the educational field is when a teacher provides feedback to a student about academic results or about their techniques of problem-solving and self-regulation in school (Hattie & Timperley, 2007). Giving feedback in the other direction—from students to teachers—seems to be an important aspect too (Hattie, 2009). Furthermore, providing feedback in both directions simultaneously could be even more powerful. This is because the teacher and student are both actors in the learning in school and both benefit from information about the transaction they create. The questions of (a) what contents and topics teachers and students should receive feedback on and (b) what kind of information is reliable are matters of intense research and are addressed in several chapters of this book, for example, Chap. 3 (Röhl and Rollett), Chap. 4 (Bijlsma), Chap. 5 (van der Lans), and Chap. 7 (Göllner et al.). This chapter refers to one specific feedback topic—the quality of cooperation between teachers and students as a class—and provides information about the views of students about their teacher and vice versa. More specifically, we also refer to a certain type of feedback—reciprocal feedback—where teacher and students send and receive feedback at the same time. Thus, the interaction between teacher and students and the dynamics of interaction can be addressed via a feedback process. The rationale and theoretical background of this kind of feedback is explained in the first part of the chapter, followed by a description of the research method used in our study. The third part presents results of a first empirical study on the effects of this form of reciprocal feedback.

# *1.1 Feedback Frequency*

A question which—to the best of our knowledge—has so far drawn little attention from researchers is: How frequently should reciprocal feedback be provided in order to trigger practical consequences? Some research on frequency has been done in occupational settings (Ilgen et al., 1979; Kluger & DeNisi, 1996; Kuvaas et al., 2017; Park et al., 2019) and on the feedback from teachers to students (Guo & Wei, 2019; Pinter et al., 2015; Tamara et al., 2004). Also, strong support for the use of highfrequency feedback has been documented in the field of psychotherapy (Schiepek et al., 2016). However, there are no empirical studies addressing the effectiveness of feedback frequency in the student-to-teacher direction. Furthermore, although dyadic regulation processes have already been investigated in some areas of psychology, there is no such research on the association of self-regulation and dyadic regulation processes in classroom scenarios.

The feedback introduced in this chapter is applied weekly and has been shown to be easily manageable (Schmidt, 2018). A higher frequency—e.g., daily—may be even more effective, as primacy and recency effects would be reduced. On the other hand, this would also be more difficult to realize. Weekly application thus seems a good compromise to foster co-regulation processes in the classroom between students and teachers—frequent and timely enough to be both effective and still manageable.

# *1.2 Interpersonal Facets of Feedback*

Interpersonal facets of feedback such as "credibility" and "sender intentions" as perceived by the recipient play an important role in the acceptance and use of feedback (Umlauft & Dalbert, 2012). Those interpersonal facets can determine whether feedback information is well received and elaborated upon or is rejected. Important characteristics of persons giving feedback are their perception of being legitimated, being seen as credible, and by their motivation and intention to support the person receiving feedback. Feedback givers must also display the ability to interact in a friendly manner so that feedback information is likely to be elaborated upon. Depending on the recipient's self-esteem and appraisal strategies, feedback carries the risk of causing negative emotions and outcomes such as lowered self-esteem and reduced effort (Leary & Terry, 2012). Feedback can be potentially perceived as unjust, and such a perception of injustice causes a variety of unwanted results, including (1) rejection of the feedback, (2) feelings of being excluded from a group, and (3) higher delinquent behavior (Mikula et al., 1990; Umlauft & Dalbert, 2012). All reported findings above focus on feedback given from instructors to their students. These mentioned risks can, however, be potentially reduced if students are involved and asked to give feedback from their perspective. When students are asked to provide feedback on cooperation with their teacher, they are implicitly addressed as competent professional partners and thus highly validated. Additionally, as teachers and students are asked to provide feedback, it is implicitly acknowledged that the views of students and teachers can differ without one being wrong or right, and that both views must be considered. Thus, perceptions of injustice could be avoided. Therefore, we see strong reasons for considering a reciprocal construction within feedback on aspects of lesson quality.

# *1.3 Cooperation—A Basic Ingredient for Lesson Quality*

Cooperation between teachers and students addresses the fundamental characteristic of lesson quality as a transactional phenomenon, meaning that both, teachers and students, have to contribute certain activities to create a lesson. Evidence implies that feedback given from students to teachers concerning student perceptions of lesson quality can contribute to teaching effectiveness (Bill & Melinda Gates Foundation, 2012; Helmke et al., 2009; Pianta et al., 2008; Raudenbush & Jean, 2015). Still missing in this base of research is a focus on the transactional and complex character of teaching and learning (Brophy & Good, 1984; Pianta & Hamre, 2009; Pianta et al., 2003). Helmke introduces his *Angebots-Nutzungs-Modell (Offer-Uses Model of Lesson Quality)* by stating that "Good lessons are a coproduction between teachers and students" (Helmke, 2007, p. 63), suggesting that lesson quality is the result of an offer made by the teacher—as well as the result of acceptance and use of this offer by students. Moreover, the subsequent offers by the teacher are influenced by the use which students may have made of former offers. This view of teaching processes is characterized by reciprocity, irreversibility, and non-linearity as characteristics of living systems (Orsucci, 2006; Schiepek, 2009).

In other areas of psychology (as compared to educational and school psychology) the focus has changed from mere (self-) regulatory to dyadic (co-)regulation processes. Importantly, in social and health psychology, the strong claim is made that being accepted by a group and being part of a group (Forgas & Fiedler, 2020) leads to better health and a longer life. Hence, there is consistent empirical evidence that social and group relationships are protective factors for psychological and physiological health: Individuals lacking social ties are physically and mentally less healthy and more likely to die prematurely than socially integrated individuals (House et al., 1988). Transferring these social relationship results to educational and school psychology means that students and teachers who (a) work together in a cooperative and friendly manner, (b) have a productive feedback culture, and (c) feel part of the social group within the classroom and/or school, should report better well-being and maybe also better academic results. However, despite results which describe a good teacher–child relationship as a predictive factor for favorable short- and long-term outcomes in students (Hamre & Pianta, 2006), there is less research on co-regulation processes in educational contexts.

#### **1.3.1 Cooperation and Student–Teacher Interaction**

The importance of student–teacher interaction for teaching and learning has been shown across many dimensions (Hamre & Pianta, 2006; Seidel & Shavelson, 2007; Verschueren & Koomen, 2012). Students who report better relationships with their teacher have higher academic success, as well as better social and emotional competences. In particular, "at risk" students benefit from a good student–teacher relationship (Baker, 1999; Birch & Ladd, 1998; Cheon & Reeve, 2015; Eccles & Roeser, 2011; Jennings & Greenberg, 2009; Raufelder et al., 2016; Wentzel, 2009). Specifically, student–teacher relationships include the domains of (1) organizational support, (2) academic support, and (3) social support (Eccles & Roeser, 2011; Pianta & Hamre, 2009). From a transactional point of view, such interactional support can be seen as the result of a successful process of cooperation between the teacher and their students. According to Axelrod (1984), cooperation is the willingness to abstain from maximum personal gain in favor of a common good including the willingness to seek compromises. The common good in this case can be defined as "lesson quality", for which both students and teachers are interested in over a long-term perspective. Contributions to lesson quality by students are behaviors such as: (1) taking out their book in a good pace, (2) working silently in order not to disturb others, or (3) raising a question when feeling blocked. These behaviors are potentially hindered by students' short-term interests in more personal gains. Example behaviors of short-term interests which may override interest in the common good of lesson quality can be: (1) making contact to a classmate, (2) taking a rest, (3) avoiding being judged by others when asking questions, or (4) low impulse control such as wishing acknowledgment for a good joke. The concept of *cooperation* serves two more benefits. Firstly, neither students nor teachers feel personally judged, since cooperation addresses an interpersonal rather than an intra-individual facet of lesson quality. Thus, feelings of humiliation are avoided—so helping to prevent withdrawal or even revenge (Furman & Ahola, 2006). Furthermore, by viewing both teachers and students as contributors to classroom success, this serves students' need for justice as described above. Secondly, asking students how they evaluate the cooperation between themselves and their teacher implicitly conveys the message that teachers see students as capable of contributing and see their contribution as important, which supports students' needs of self-efficacy and self-determination (Ryan & Deci, 2009). Feedback which focuses on the cooperation between students and their teacher should help the contributors reflect on their cooperation and improve it in a threefold manner: (a) by helping students to bring up ideas for the improvement of lesson quality which are from their perspective relevant, (b) by creating a situation in which teachers can learn about how students perceive lessons, tasks, and explanations and thereby receive insight into the effects of their teaching, and (c) by improving social support when listening to each other and implementing ideas developed together.

# **2 A Feedback Technique for Iterative Feedback About Student–Teacher Cooperation**

In order to implement reciprocal feedback as described above, we developed and tested a method which focusses on the quality of cooperation between a teacher and their class as perceived by both parties. Students and teachers give their feedback weekly at the end of the last school lesson. To do so, they answer the core question of the feedback technique, "How do you evaluate the cooperation between you as a class and your teacher during the last week?", by throwing a coin into a box with five labeled compartments (*very good, rather good, average, rather poor,* and *very poor*) for possible answers. The teacher answers the equivalent question, "How do you judge the cooperation between you and your class during the last week?", by throwing a different colored coin into the box. Results of each feedback session—the distribution of the students' answers and the teachers' answers—were displayed on a classroom poster at the beginning of the first lesson of the next week, and the results of all weeks remained visible during the whole feedback period. Teachers and students were invited to discuss the results of the feedback each week following a solutionfocused protocol in which the teachers had been trained. Thereby, the classes are guided to discuss: characteristics of weeks with higher quality of cooperation ("Why did you assess this week as having better cooperation than this other one?"); which teacher activities and which student activities contributed to good lesson quality ("What did I do to help us cooperate in this week? What did you do?"); and what could each side do to further contribute to lesson quality ("What can I do to improve our cooperation? What could you do to improve our cooperation?").

# **3 Own Empirical Study**

A first controlled trial study was conducted in the field of teachers' health. We specifically investigated the effects of the reciprocal feedback method on teacher health. The rationale behind this was that the reciprocal technique could help teachers take the transactional character of lesson quality more into account by using information the students give, which would in turn foster cooperative activities which the students can participate in. The first research question was: Does teacher health improve during or after the feedback period?

To ensure an appropriate application of the feedback method it is required that teachers share the underlying idea that quality of cooperation is a core ingredient of good lesson quality and that students can contribute important information to the improvement of cooperation. Therefore, we also measured teachers' *Resource Orientation* in respect to their students. Resource Orientation is the assumption that students have the ability to assess lesson quality and to develop ideas for the improvement of cooperation. Our hypothesis was that the experience of iterative feedback on cooperation should lead to a higher Resource Orientation among the teachers through the experience of better cooperation, and thus reduce occupational stress which arises when teachers try to manage the class by relying primarily on their own activities. The second research question was: Does the perceived quality of cooperation as assessed by the students and by the teachers improve during the feedback period of three months?

# *3.1 Procedure*

The sample consisted of 45 teachers from southern German mid-level schools and one of their classes between 6 and 9th grade (1022 students).

Each of the 45 teachers chose one of their classes in which they taught at least three lessons a week, and asked students to participate in the study. Teachers were randomly assigned to a treatment group (*n* = 23) or a waiting control group (*n* = 22). Resource Orientation and Teacher Health were assessed in the treatment and waiting control groups at three points of time (*T*0, *T*1, *T*2) with 12-week intervals between each time point. After students and their parents gave written consent, the first measurement (*T*0) took place. Subsequently, teachers of the treatment group received a one-day training for the feedback method and a group supervision session after four weeks. Teachers of the waiting control group received their training after *T*2. Immediately after the training, teachers and students in the treatment group applied the reciprocal feedback technique in their classes once a week for a consecutive period of 10 weeks. The supervision sessions during the feedback period were held in order to support teachers' use of the student feedback, helping them to understand the students' needs and how to lead solution-focused class talks, so that specific actions in the classroom could be derived from the feedback. For a more detailed description of the process of recruitment, random assignment, and data analysis see Schmidt (2018).

# *3.2 Measures*

Teacher health was assessed with the General Health Questionnaire (GHQ-12) (Goldberg, 1992). The GHQ-12 is a frequently used worldwide screening instrument for detecting mental health problems. It assesses the inability to carry out one's normal healthy functions and the appearance of new phenomena of a distressing nature. The GHQ-12 asks about mental health issues during the last two weeks in comparison to the usual status of the participants. The questions include, for example, "Have you recently been feeling sad and gloomy?" Answers are coded on a four-point scale labeled e.g., less than usual, no more than usual, rather more than usual, much more than usual. Higher values indicate a higher problem level. The internal consistency of the GHQ-12 has been reported in a range of studies using Cronbach's alpha with correlations between .77 and .93.

To examine teachers' Resource Orientation, a scale called Resource Orientation Scale (ROS) was developed. The ROS consists of 12 items asking teachers how far they agree that (a) students are able to assess teacher–class cooperation and lesson quality (e.g., "My students can assess if they receive good individual support"), (b) students have useful ideas for the improvement of teacher–class cooperation and lesson quality ("My students have good ideas about what kind of support they need"), and (c) if the teacher actually uses the knowledge of students to improve lesson quality ("I use students' ideas on how to make tasks activating"). To quantify the extent of approval of the statements, answers were given on a four-point scale ranging from 1 (not true) to 4 (true). The measure's internal consistency was acceptable across time with Cronbach's alpha ranging α = .82 at *T*0; α = .87 at *T*1; α = .89 at *T*2.

The perceived quality of cooperation was gathered by comparing the feedback of students and teachers at the beginning of the feedback process (*T*1) and at the end of the process (*T*2). Therefore, results of the first three weeks and results of the last three weeks of the period were averaged.

# *3.3 Results*

To assess the effects of the training, treatment and control groups were compared with respect to changes of the outcome variables from *T*0 to *T*1 and from *T*0 to *T*2, using regression analysis (Table 1). Therefore, outcome variables were *z*-standardized to *T*0 means. Teachers' Resource Orientation increased significantly from *T*0 to *T*1 and teacher stress scores decreased significantly from *T*0 to *T*2, as reported in Tables 2 and 3, respectively. The patterns of changes of the Resource Orientation Scores (ROS) and teacher health (GHQ-12) scores in treatment and control group over all three points of measurement are displayed in Fig. 1.

To assess changes in the perceived quality of cooperation, *T*-Tests for dependent samples have been applied. Perceived Quality of Cooperation as assessed by teachers and by students increased significantly during the three-month feedback period with


**Table 1** Unstandardized scores for resource orientation and teacher health outcomes at all measurement points

*Note* ROS = Resource Orientation Scale; GHQ-12 = General Health Questionnaire


**Table 2** Regression analysis: treatment effects at *T*1

*Note* ROS = Resource Orientation Scale; GHQ-12 = General Health Questionnaire \* <.05, \*\*\*<.001



*Note* ROS = Resource Orientation Scale; GHQ-12 = General Health Questionnaire \* <.05, \*\*\*<.001

*t*(16) = 4, 24; *p* = .001; *d* = 1, 12 for the students' view and *t*(15) = 3.90; *p* = .001; *d* = 1.30 for the teachers' view. Descriptive results for all classes of the treatment group can be seen in Fig. 2.

# **4 Discussion**

Choosing quality of cooperation as a topic of feedback between students and their teachers and then applying reciprocal feedback repeatedly in a weekly frequency seems to be a promising approach for initiating improvement of lesson quality. Improvement in the perceived quality of cooperation from both the students' point of view and the teacher's point of view has been shown. Moreover, providing feedback about the perceived quality of cooperation to classes and inviting students to discuss cooperation in order to facilitate high lesson quality yielded improvements in teacher health. Furthermore, using such feedback for discussions between students and their teacher addresses a core process of lesson quality, since it fosters the effective use of feedback by addressing teachers and students in their role as cooperative partners.

**Fig. 1** Resource Orientation Scores (ROS) and teacher health (GHQ-12) scores in treatment and control group

This empirical evidence should encourage further research, as there are several limitations of the study. Firstly, the choice of classes by their teachers was deliberated. Teachers pointed out that they chose classes in which (a) improvement of cooperation between teacher and class is needed from their perspective and (b) they were confident that the group of students would be capable of using the method effectively in terms of the social relations among the students. Conflicts and a poor social climate among the students may be an obstacle to such feedback or might have to be addressed first. Secondly, future research should investigate effects of the suggested kind of feedback on other lesson quality measures than the perceived

**Fig. 2** Perceived quality of cooperation in each class (*Note M*1 = average of first three weeks of feedback; *M*2 = average of last three weeks of feedback. Due to technical barriers, not all classes provided data for all weeks of the feedback)

quality of cooperation; such measures could include time on task, cognitive activation, or emotional support, as assessed by students or external observers. We would tentatively suggest that improvement in those measures may well be as a result of improvements in cooperation between teachers and their students. Thirdly, the applied feedback technique includes several characteristics which should be further investigated. For example, the high-frequency application of feedback could possibly be tested with other feedback topics or methods. As things can develop fast in living systems, real-time data concerning the state of a system are crucial for understanding and adapting to particular situations. The reciprocal approach—inviting students and teachers to give feedback at the same time on the same topic—can be applied to other feedback topics. Lastly, the hypothesis that the type of feedback studied here fosters student–teacher relationships should be investigated more thoroughly.

In addition, further studies are needed which examine long-term effects of the regular use of iterative and reciprocal feedback on student–teacher relationships, teacher health, and students' academic results. Moreover, the idea that students can be viewed as partners in cooperation to improve lesson quality and that they can provide useful information to the process of cooperation should play a role in teacher education and teacher training—here teachers would develop an attitude and learn techniques to continuously strive for high lesson quality.

# **References**


*Handbook of motivation at school* (pp. 171–195, Educational psychology handbook series). Routledge/Taylor & Francis Group.


**Jan-Erik Schmidt** worked in a residential youth home, in a family-counseling-center, and is school-psychologist at the Center for School-Quality and Teacher-Education in Tübingen (Germany), since 2008. He received his doctoral degree at the University of Tübingen in 2018. His areas of interest are patterns of cooperation in educational settings and change management in complex systems. Jan-Erik is a solution-focused family therapist and board member of the European School Psychology Center for Training.

**Caterina Gawrilow** is Full Professor for School Psychology at the Eberhard Karls University Tübingen (Germany), since 2013. She studied psychology at the Philipps University Marburg (1997–2002), and received her Ph.D. in 2005 at the University of Konstanz. Caterina Gawrilow also worked as a Postdoc, including at the New York University and the University of Hamburg. Till 2013 she was an Assistant Professor at the Leibniz Institute for Research and Information in Education (DIPF) and the Goethe University Frankfurt. Her research focuses on cognitive, emotional, social, motivational, and neuronal correlates of ADHD, on self-regulation and self-regulation interventions—in experimental and intensive longitudinal designs.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Part III Relating to Other Fields of Research**

# **Chapter 13 Student Voice and Student Feedback: How Critical Pragmatism Can Reframe Research and Practice**

**Mari-Ana Jones and Valerie Hall**

**Abstract** This chapter recognises the diverse definitions and practices of student feedback; focussing on how student feedback can facilitate dialogue and thus contribute to the development of schools as democratic communities. Student feedback is thus positioned as a part of student voice, which has its roots in the United Nations Convention on the Rights of the Child (UNICEF, 1989). We question the ways in which schools elicit the views of students and how students' opinions are made use of, recognising the complexities arising from power relationships (Hart, 1992), the consumerisation of education (Whitty & Wisby, 2007) and the pressures of accountability. Furthermore, we consider ways in which researchers can address difficulties in the research-practice relationship (Chapman and Ainscow, 2019) and facilitate co-creation of research. We propose the perspective of critical pragmatism as a means to acknowledge the complexities of practice, whilst also highlighting the importance of critical reflection and dialogue. Critical pragmatism could move us from a "deconstructive scepticism toward a reconstructive imagination" (Forester, 2012, p. 6) in which schools and researchers collaborate to enable contextually rich practices of student feedback and student voice.

**Keywords** Student feedback · Student voice · Critical pragmatism · Dialogue · Reflection · Collaboration

# **1 Introduction**

We recognise that there is a fundamental belief in the need for schools to provide safe environments in which students can speak, and for student feedback to be used to implement change (Defur & Korinek, 2010). After all, "the first claim of the school is that of its pupils for whose welfare the school exists" (Stenhouse, 1983,

M.-A. Jones (B)

Norwegian University of Science and Technology, Trondheim, Norway e-mail: mari.a.jones@ntnu.no

V. Hall University of Wolverhampton, Wolverhampton, UK

p. 153). There is, however, much discussion about what we mean when we talk about "student feedback". We have chosen to accept the premise that student feedback can be defined as "the use of formal processes to gather information from students about their perceptions of teacher practices, teacher effectiveness and the quality of educational programmes" (Mandouit, 2018, p. 756).

However, the vocabulary used around student feedback has become increasingly diverse, with concepts holding different meanings for those involved (Forrest et al., 2007). The context within which such feedback is situated varies enormously: the cultural and environmental influences; the methods and practices used to elicit such feedback; policy and regulatory frameworks; the students and staff, and their relationships; and the purposes for which such feedback is sought. The overall intent may be about improvement, but the drivers come from a broad spectrum of need: from a performativity perspective that can demonstrate accountability and effectiveness (Verhaeghe et al., 2010); to opening up a "dialogue around teaching and learning in the classroom …[that could give]…. teachers insights into the unique challenges experienced by their students" (Mandouit, 2018, p. 755). In this form, student feedback can be identified as a form of student voice, with schools aiming to serve as democratic environments in which structures can be created that enable students, teachers and the broader school family, to have "meaningful involvement in decision-making processes" (Defur & Korinek, 2010, p. 19) and for teachers' classroom practice to be improved (Bourke & Loveridge, 2016; Mitra, 2008). Such opportunities for participation encourage the development of a student's sense of agency and self-worth; a sense of belonging and reflection on past, present and future relationships (Thompson, 2005).

Student voice and student feedback derived from different agendas. The expansion of interest in student voice can be traced to Article 12 in the United Nations Convention on the Rights of the Child (UNICEF, 1989), which states that children have the right to be heard. Student feedback—mainly developed in higher education institutions and intended as a quality assurance measure (Harvey, 2003)—has been used to gather views for a specific purpose. As such, their foundations are somewhat different, but there are important interconnections. At their best, they enable a collaborative dialogue and the development of consultation across all stakeholders (Nelson, 2015). At their worst, they become instrumentalist in demonstrating compliance (Charteris & Smardon, 2019), or tokenistic in positioning students as consumers of education (Hall, 2020). We need to consider whether students are being engaged as "insiders" or "outsiders" (Forrest et al., 2007, p. 26): are they informing practice from within—through collaboration and agency; or is their purpose only to help fulfil the requirements of accountability frameworks?

Within this chapter, our focus is on student voice, but we acknowledge that this concept is also evidenced within student feedback and that there are many interlinking practices and connections between the two. We are thus interested in the ways in which researchers are in a position to support schools to critically explore how school communities espouse, enact and experience student voice (Hall, 2020). This chapter offers a critically pragmatic perspective that has the potential to enable student voice research to recognise the aspirations of student voice whilst not losing sight of the realities of school life. Schools can be enabled to reclaim student voice. They can value their own local knowledge and experiences and their own contexts, with the "thoughtful and serious consideration of student voice" (Keddie, 2015, p. 227) having the potential to yield considerable benefits. In doing so, opportunities emerge to develop contextually relevant practices that are: enriching for students and teachers alike (Bragg & Manchester, 2012; Fleming, 2015); that take into account the diversity of concepts and contexts (Mandouit, 2018; Verhaeghe et al., 2010); and that consider the ways in which discourses interconnect and overlap within student feedback and student voice (Charteris & Smardon, 2019). The discussion within this chapter, therefore, considers ways in which we as researchers may *begin* to facilitate "co-creation" of research; how we *mediate* and *broker* knowledge through "engaging in the identification and formulation of knowledge needs" (Wollscheid et al., 2019, p. 289).

# **2 Situating the Chapter**

Within the broader context of student voice, and by association student feedback, there are many rich discussions taking place across an international arena. Central to much of this is the acknowledgement that there are difficulties in accommodating national policies and competing priorities, differences in school contexts, and views that exist on pedagogical approaches, as evidenced by research emerging from countries who are working on collaborative European projects (Bron et al., 2018; Holcar Brunauer, 2019). We need, therefore, to appreciate the constraints and challenges imposed on schools endeavouring to meet requirements, wherever they are situated. As demonstrated by research from New Zealand (Bourke & Loveridge, 2016, p. 59) this also needs to recognise that sometimes the focus is on "what can be changed, and not what confronts practices especially if the student feedback is challenging". The discussion within this chapter thus aligns with themes across this wider debate and seeks to broaden perceptions of student voice research and practice, highlighting some of the key drivers that influence, and sometimes hinder, the development of a more critically pragmatic approach: a "philosophy for professionals" (Ulrich, 2007, p. 1112).

To make improvements in student outcomes we know that it makes sense to go straight to the source as students can not only share opinions about their classroom experiences, but also play a significant role in school improvement efforts. But how do we best involve students in school decisions that will shape their lives and the lives of their peers? (Mitra, 2008, p. 20).

There are, however, concerns about the methods used to elicit student feedback surveys, questionnaires, evaluation results. These relate not only to the validity of their construction and the questions asked, but also the ways in which any results are interpreted (Darwin, 2016), for "It is not just the collection of data that is important, but the value that is placed on student evaluations" (Blair & Noel, 2014, p. 881). Institutions frequently find themselves operating between two conflicting objectives, "one which is focused on directives that accord success for meeting targets, and the other based on aspirations to enhance the community by allowing each student the possibility to be heard" (Shuttle, 2007, p. 33). A methodological quest for "authentic" student responses should be treated with caution (Spyrou, 2016), and Nelson (2015, p. 5) argues that a notion of authentic student voice "masks how power relations operate" in the production of student voice. Consideration needs to be given to *who* is assigning *value* and *worth* to such dialogue, and the emergent data, and how *equal* is the potential for all individuals to be involved (DeFur & Korinek, 2010).

Hart (1992) recognised that there would be issues of power and participation when adults in such settings attempted to work in partnership with children. His much vaulted "ladder of participation"—moving from levels of non-engagement (manipulative and tokenistic) through to levels of engagement with evidence of growing consultation, agency and the development of shared power (with the potential for transformation)—has acted as a catalyst for much discussion in the arena (Fielding, 2001, 2011; Groundwater-Smith & Mockler, 2016). Student voice has become a right and a "key aspect of youth agency" incorporating varied practices, but these require "careful, situated interpretation if we are to understand their meanings and effect" (Bragg & Manchester, 2012, p. 143). This raises some fundamental questions for both students and teaching staff. Students may feel alienated through what might be seen as a "tokenistic" approach to student voice (Fielding, 2011) or consider themselves being positioned merely as "consumers of education" (Whitty & Wisby, 2007, p. 303). Teaching staff may experience similar tensions in their understanding of the implications of student voice for teacher professionalism and whether it should be regarded as "an important element in establishing a 'collaborative' or 'democratic' professionalism, or a challenge to teachers' authority and cement an associated 'managerial' model of professionalism" (Whitty & Wisby, 2007, p. 303). These discourses are linked, and even overlapping at times (Charteris & Smardon, 2019), and consequently, schools have pressed on with various student voice initiatives that might demonstrate collaboration and engagement both for compliance purposes but also undoubtedly with good intent to engage learners in constructive dialogue. Due to the constraints of complex regulatory frameworks which require evidence of both compliance and learning, however, schools are rarely able to step back; to challenge and to seek ways in which student voice can be not only a "tool for change", but also a "tool for reflection" (Bourke & Loveridge, 2016, p. 65).

So, the challenge from our perspective is how research can work more closely with teachers and empower them to incorporate student voice as "part of their own professional learning and development" (Bourke & Loveridge, 2016, p. 66). If school leaders and teachers can become more "invested" in the creation and development of knowledge, they can participate further in the drive to identify and formulate those knowledge needs (Kauffman et al., 2017). This does not necessarily mean a call for more methodologies that facilitate the involvement of schools, or for greater engagement with action research, but rather an avoidance of a "linear dissemination from experts to practitioners" (Blackmore, 2007, p. 28). Our discussion, therefore, moves on to consider ways in which we might be able to "reframe" research within this context.

# **3 Reframing the Role of Student Voice Research**

Hargreaves (1999, p. 125) described a "division" between researchers and practitioners and Lester et al. (2002) raised the issue of how teachers and researchers might be expected to communicate when they obviously occupied different worlds. Researchers are frustrated with simplistic, mechanistic practices whilst teachers are subject to the supposed "moral and intellectual authority" of researchers who "derive their power" from criticising at a distance (Chapman & Ainscow, 2019, p. 915). Blackmore (2007) described the failings of a linear view of the research-practice relationship in which knowledge is supposed to be passed down from academics to practitioners, pointing especially to the inaccessibility of the reporting of findings. This implies that language is the most significant barrier, however, Biesta et al. (2019) raise fundamental questions about perceptions of the relevance of research. Chapman and Ainscow (2019) criticise the ways in which knowledge is produced by research, advocating for an inclusive "messy social learning process" (p. 914) which addresses unequal power relationships between researchers and practitioners.

We contend that these issues are especially noticeable in student voice work. At times, as Mager and Nowak (2012, p. 50) suggest, student voice researchers have conducted "too little methodologically strong research". Fielding (2011, p. 10) argues that student voice research has not paid enough attention to theoretical frameworks, due to the "corrosive nature of market-led approaches". There are questions about the value of student voice research and its capacity to influence practice. Likewise, the emancipatory and empowerment traditions of student voice research contribute to difficulties in the enactment and experience of student voice (Hall, 2020). Whilst teachers were willing to engage with research, Bourke and Loveridge (2016) report that it was challenging for them to take account of findings which appeared to contradict their experiences and views. Harris (2010, p. 88) concurs, noting that teachers had varying responses to findings:

There were some teachers in each group who really wanted to be handed immediate ideas that they could take back to their classrooms. Others felt they came with considerable expertise and there was nothing new they did not already know. Still others were pleased to engage in reflective discussion and make their own links to classroom practice whilst being open to new ideas.

There are surprisingly few mandates for teachers to connect with educational research, despite the professionalisation of education, with teachers often seen as receiving knowledge from external sources, rather than being part of creating it (Wollscheid et al., 2019). An argument further supported by Harris (2010), who suggests that teachers are expected to receive and reproduce knowledge. Our intention, therefore, is to propose a reframing of research and the roles of researchers and practitioners, which involves a "reconstruction of relations" (Hargreaves, 1999, p. 136) in which teachers are "at the heart" (ibid.). To achieve these aims we need a "brokering" system (Wollscheid et al., 2019, p. 270) in which knowledge moves fluidly and dynamically between research and practice. For student voice research, these suggestions would support the construction of a dialogue which is more in keeping with the democratic, inclusive and transformational traditions of the field (Fielding, 2011). In this way, the concerns raised about how research might "reach" the "practice of education … [moving the focus and so] … changing the location of research and the identity of the researcher" (Biesta et al., 2019, p. 2) lead us to our next consideration: where, and how, such a shift might be enabled.

# **4 Critical Pragmatism as a Way Forward**

So, having established the tensions, constraints—and possibilities, how might we find a way forward? What has emerged from the discussion is the need to reach "beyond the confines of technical philosophy" (Dewey, 1949, p. xiv) towards a more critical approach. Such a perspective has the potential to help "bridge" the gap and facilitate discussion between research and practice and to have progressive adjustments made "in light of collective deliberation grounded in the experience of every member of society" (Curren, 2010, p. 494). Before considering its relevance and how it might be applied to student voice research and practice, it is necessary to first define critical pragmatism.

Critical pragmatism incorporates both pragmatism and critical theory. Dewey (1925) identifies Peirce (1839–1914) as the originator of pragmatism, having been inspired by Kant's 1785 distinction between the practical and the pragmatic. Dewey (1925) explains that Peirce was interested in how concepts could be made clear, which according to Peirce, could only be achieved by their application to human experience. Dewey (1925) elaborates, arguing that action is the intermediary through which concepts gain meaning. Furthermore, because actions can be different, meanings can be different. Biesta (2006, p. 30) interprets Dewey's thinking thus; "it is because people share in a common activity that their ideas and emotions are transformed as a result of the activity in which they participate". When applied to student voice, this understanding of pragmatism can help to explain variations in understandings and practices between schools, as well as divisions between the conceptualisation of student voice in theory and policy and how it is practised. Put simply, the concept of student voice is actioned in many ways, leading to multiple experiences and understandings. The critical aspect is crucial; encouraging reflective practice and drawing attention to power issues. As Feinberg (2015, p. 151) explains "the distinctive task of critical pragmatism is to bring competing norms to the surface, to show how they impede experience and to encourage the formation of new ways."

At the start of this chapter, we began to explore some of the tensions that exist between student voice used as an accountability measure (Verhaeghe et al., 2010), and student voice being part of schools' democratic processes (Bourke & Loveridge, 2016; Defur & Korinek, 2010; Mitra, 2008). We witness teachers and schools endeavouring to meet external accountability requirements connected with education's "marketisation and the development of a consumer culture" (Murphy & Skillen, 2013, p. 84). Keddie (2015, p. 226) describes teachers having "a sense of powerlessness and high levels of uncertainty". Teachers express concerns about the erosion of their "ability to complete what they consider core professional tasks – dealing with the issues and concerns of pupils" (Murphy & Skillen, 2013, p. 89). In this climate, there is a danger that the potential of student voice as a reflective tool can be forgotten. For schools, critical pragmatism as a lens can be useful as a means of acknowledging the demands of accountability, whilst also encouraging critical reflection. A critical pragmatist perspective suggests compromise rather than an either-or perspective; student voice need not be either for accountability or for democracy. Rather, by providing "fertile ground on which such ideas can be questioned, refined or even transformed" (Murphy & Skillen, 2013, p. 95), it enables schools to critically reflect on their practices of student voice. For example, the ways in which they are collecting the views of students and what they are doing with the data.

For student voice researchers, critical pragmatism encourages an acknowledgement of the realities of the complex network of demands on schools and the need for action, as well as a recognition of the importance of local knowledge and understanding in the practice of student voice. Taking a critical pragmatist stance mitigates against researchers becoming overly critical of student voice practices, instead highlighting the importance of examining contextualised practice. The potential now exists for research and practitioners to recognise and acknowledge that it is no longer enough for the role of research to be rooted in production of "evidence-based practice" or "evidence-informed teaching"': the "*what works*" as discussed by Biesta et al. (2019). Rather, researchers can seek to co-construct research knowledge that is "geared towards producing *useful* knowledge which is able to answer the questions practice ask … [whilst also acknowledging] …. What does it work *for*?" (Biesta et al., 2019, p. 2).

It is, therefore, time to reframe our perceptions and perspectives so that rather than "determining practice" we grasp the potential for research to 'inform practice', with teachers viewed not as "recipients of research and reproducers of knowledge", but rather as "producers and interrogators of research and builders of knowledge" (Harris, 2010, p. 83) in their professional capacities. A critical pragmatist orientation could thus have the potential to foster mediation, respecting the perspectives of all those involved, and—crucially—enabling each to learn "from, and about, each other, so that they can work to invent creative new options for action, [and] work to produce pragmatic outcomes serving their values and interests, as well" (Forester, 2012, p. 13). Critical pragmatism can enable a dialogue between researchers and schools—mutual recognition of each other's standpoints and encourage learning from each other.

# **5 Conclusion**

Although the focus of our chapter is student voice, we highlight interconnections with student feedback, appreciating that in spite of their different foundations and agendas, the two concepts have much in common.We have considered the diversity of concepts and contexts (Mandouit, 2018; Verhaeghe et al., 2010), acknowledging that there are discourses that interlink and overlap (Charteris & Smardon, 2019). The capacity for collaborative dialogue and consultation across stakeholders (Nelson, 2015) on the one hand; but also, the potential to be instrumentalist, tokenistic and compliance driven (Charteris & Smardon, 2019) on the other. Our aim in this discussion, therefore, is for student feedback and student voice research to be understood as "bounded in both the context and the culture of specific settings … [that make it] … complex, challenging and contradictory" (Fleming, 2015, p. 224). In doing so, we broaden the debate about the ways in which both student feedback and student voice are "espoused, enacted and experienced" (Hall, 2020, p. 125) by researchers and in schools.

Situated amidst complex regulatory frameworks, schools at times operate between conflicting objectives. It can, therefore, be difficult to see ways in which student feedback and student voice research can navigate competing priorities, institutional contexts, and pedagogical beliefs (Bourke & Loveridge, 2016; Bron et al., 2018; Holcar Brunauer, 2019). Researchers have an important role. Instead of positioning ourselves as remote experts, disseminating our findings and criticising practice from afar, we are suggesting the development of a "close-to-practice" approach (Wyse et al., 2020, p. 20). Researchers should seek collaboration with practitioners, thus encouraging an iterative process of research and application that includes "reflections on practice, research, and context" (ibid.). If there is to be change, then it needs to be through mediation of the knowledge (Wollscheid et al., 2019); and that knowledge has to have been co-constructed. A critically pragmatic perspective for both researchers and schools could facilitate the development of contextually rich practice(s)—recognising the constraints that schools operate within, whilst taking the strengths of pragmatic thought, valuing local knowledge and experiences (Keddie, 2015) and also contributing a critical lens.

To support these aspirations, we propose the following:


We suggest that critical pragmatism could provide a means through which to work towards these aims, enabling us to "rethink the complexities of deliberative processes" (Forester, 2012, p. 6); for researchers to start from where schools are and at the same time enable schools to critically examine their practice. Our premise, therefore, is that critical pragmatism could move us from a "deconstructive scepticism toward a reconstructive imagination" (Forester, 2012, p. 6) where there are possibilities for*joint* gain; and for *multi-directional* gain that may satisfy the multiple and diverse needs of all.

# **References**


**Mari-Ana Jones** After working as a teacher and school leader for over fifteen years in the UK and Norway, Mari-Ana Jones took up a research position (Ph.D.) at the Norwegian University of Science and Technology (NTNU). Her research interests include student voice and school improvement, with a particular focus on the role of school leadership. Mari-Ana leads the Master's in Educational Leadership at NTNU and is an Academic Supervisor on the National Program for School Leaders in Norway. She is also a consultant on school improvement and leadership in several school districts in Norway.

**Valerie Hall** has spent almost thirty years in Further and Higher Education in the UK in senior leadership positions, teaching and supervising across Masters and Doctoral level programmes, and in the last decade having a strategic lead in Initial Teacher Education (ITE) in the postcompulsory sector. Her practice and research focus is centred on student voice, and explores individual perspectives, including self-efficacy, self-categorization, and social identities. Valerie currently holds the position of Honorary Research Fellow at the Education Observatory, University of Wolverhampton. She is also an External Academic Advisor to another ITE provider in the UK.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 14 What Can We Learn from Research on Multisource Feedback in Organizations?**

**John W. Fleenor**

**Abstract** This chapter provides a review of the current state of empirical research on the use of multisource feedback (MSF) in organizations (e.g., Church et al., 2019). The review covers key topics on the research and application of MSF for developing leaders in organizations. The focus of the chapter is on how research on MSF can be applied to the implementation of student feedback to teachers in schools. Based on this research, recommendations are offered for successfully executing student feedback in schools. Topics include: (a) characteristics of effective MSF, (b) how to implement an MSF process in an organization, (c) factors that affect the reliability and validity of MSF, (d) a discussion of agreement between self-ratings and the ratings of others, (e) how to facilitate feedback to leaders, and (f) reasons why MFS processes may fail in organizations. Finally, the transferability of these findings to student-to-teacher feedback in schools is discussed.

**Keywords** Multisource feedback · Organizational psychology · Leadership development · Student-to-teacher feedback

# **1 Introduction**

In organizations, feedback can have a major impact on the quality of the employees' performance. Therefore, it is important that accurate and relevant feedback is provided to the organization's leaders. Because of the hierarchical structure of organizations, feedback is usually provided only by the leaders' own managers, which is, of course, a limited perspective of the leaders' effectiveness. A valuable tool for delivering such feedback is multisource feedback (MSF; also known as 360-degree feedback). In recent times, the growth of MSF has been a significant trend in the leadership development field. Since its inception in the late 1980s, MSF has gained increasing acceptance and importance in organizations (Silzer & Church, 2009).

J. W. Fleenor (B)

Center for Creative Leadership, Greensboro, NC, USA e-mail: fleenorj@ccl.org

In the organizational context, feedback is defined as information provided to employees related to their behavior on the job and the impact of that behavior on others (Fleenor & Taylor, 2019). Feedback is intended to strengthen desired behaviors and to recommend changes for undesired behaviors. Under the correct conditions, feedback can be a catalyst for change (Fleenor et al., 2020).

Most employees want to know how well they are doing their jobs. If they do not receive sufficient feedback, they often seek it on their own (Fleenor et al., 2020). Receiving useful feedback is an important motivational factor that can lead to increased job satisfaction (Bracken & Rose, 2011). Feedback can enhance selfawareness by identifying strengths and can facilitate growth by highlighting areas for improvement (Nowack, 2019).

The impact of multisource feedback can be significant if it is embedded in a larger leadership development process. That is, if it is fully integrated into the human resource management (HRM) system of the organization. Research has found that MSF can improve performance and lead to behavior change over time (Smither et al., 2005; Walker & Smither, 1999). The implementation of MSF has been shown to improve the financial performance of organizations through increased knowledge sharing and employee effectiveness (Kim et al., 2016).

For student-to-teacher feedback in schools, the most relevant counterpart to MSF in organizations is upward feedback. In upward feedback, ratings are solicited from the direct reports of the leader being assessed. This is a relatively common practice in organizations because direct reports are thought to be in the best position to judge a leader's effectiveness. The same could be said for the students of the teachers in schools. However, it would also be possible to conduct the "full circle" of feedback for teachers by including self-ratings and ratings from colleagues and headteachers or principals.

There are parallels between being a leader in an organization and a teacher in a school. For much of the discussion in this chapter, "teacher" could be substituted for "leader" and "student" could be substituted for "rater."

# **2 Multisource Feedback**

The purpose of multisource feedback is to provide accurate and useful feedback related to the effectiveness of leaders in their organizations (Fleenor & Brutus, 2001). This process includes collecting and reporting coworkers' ratings of a leader's effectiveness and providing feedback and coaching for each leader. Traditionally in organizations, feedback has come from a single source, the manager, which provides only a limited perspective of a leader's effectiveness. With MSF, the assessment of a leader's strengths and development needs is more reliable and valid. Because it uses multiple raters, MSF provides different perspectives of performance, making the feedback more accurate and useful to the leader. Additionally, the collection of feedback from several raters with different relationships to the leader will decrease the effects of the biases of the individual raters on the ratings.

There is little agreement in the literature on the terminology used in multisource feedback (Fleenor et al., 2020). In this chapter, the individual being assessed is referred to as the leader. Coworkers who provide the feedback are called raters, and usually include peers and direct reports. The leader's direct boss is referred to as the manager, who also provides feedback. The MSF survey that is completed by the raters is called the assessment. The scales on an MSF assessment represent leadership competencies that are important for success in the organization. MSF is sometimes used with employees who are not leaders; however, in that case, there are no direct-report raters.

# *2.1 The Multisource Feedback Process*

Most MSF processes have the following features (Fleenor & Taylor, 2019):


Typically, in an MSF process, the leader selects a number of coworkers to participate in the feedback process. Working individually, the raters and the leader complete surveys designed to collect information about the leader's specific skills, behaviors, and other attributes that are important for leader effectiveness. Leader effectiveness is defined as performance that makes leaders successful in their organizations (e.g., the leader's team successfully meets its goals for the year; Fleenor et al., 2020).

After raters complete the surveys, their ratings are electronically sent to a centralized location for scoring. A report is produced and delivered to a feedback coach, who then meets with the leader to review the report. The coach can be an internal human resource (HR) professional or the leader's manager who is trained to interpret the results of the assessment and assist the leader in understanding the report. The coach helps the leader use the feedback to create a plan to address developmental needs identified by the feedback.

Multisource feedback provides a structured means of collecting and processing data, and an opportunity to reflect on this valuable information. It may be the only opportunity some leaders have to consciously self-reflect on their effectiveness. MSF systems also guarantee the anonymity of the raters. There is evidence that anonymous feedback is more honest than open feedback (Kozlowski et al., 1998). This appears to be particularly true when direct reports are rating their leaders. A climate of trust must be created for the MSF process—when anonymity is ensured, the feedback will be more accurate. If raters believe that anonymity was violated, then less honesty can be expected in future MSF administrations, with a corresponding loss of reliability and validity (London & Wohlers, 1991). Anonymity differs from confidentiality. Confidentiality requires that access to MSF data be limited to individuals who are permitted to see the data in accordance with organizational policy. Confidentiality is important to ensure the participants that their data are protected and will not be seen by unauthorized individuals in the organization. A lack of confidentiality may result in lower participation rates in future MSF administrations.

# *2.2 Using Multisource Feedback for Leader Development*

Because of its structure, thoroughness, and anonymity, MSF is likely to be accepted and acted on by the leaders receiving the feedback (Atwater et al., 2007). To ensure the effectiveness of MSF, it should be implemented within a broader leadership development context. For example, MSF should be integrated into the organization's leader development and succession planning systems to help identify how leaders can become more effective in their organizations. The organization's leadership development system is responsible for providing activities, such as MSF, that will increase the effectiveness of its leaders. The succession planning system is responsible for creating a pipeline of leadership talent for the future. The integration of the leader development and the succession planning systems should create conditions that allow leaders to receive ongoing feedback along with new job assignments, thus increasing their current competencies (McCauley & Brutus, 2019).

Many organizations use MSF as an integral part of development processes for individual leaders. Even when leaders have good insights about their own strengths and development needs, they may not be fully aware of how their behaviors affect their coworkers (Fleenor et al., 2010). After they receive the results of their MSF assessment, leaders have a clearer idea of how their behaviors consistently affect others.

In addition to its use in developing individual leaders, some organizations use aggregated MSF data to determine group strengths and weaknesses for needs analysis purposes. Furthermore, the process of responding to the assessment underscores desired behaviors and creates discussion of which behaviors are valued throughout the organization. This occurs because the items on the MSF assessment indicate what leadership behaviors are considered important by the organization (Bracken & Rotolo, 2019).

# *2.3 Characteristics of Multisource Feedback*

The characteristics of MSF can be thought of as the interactive product of both the assessment and the raters (Bracken & Rose, 2011). According to Bracken and Rotolo (2019), the most important characteristics of MSF are: (a) awareness of the feedback (including reactions and receptivity); (b) acceptance of the feedback; and (c) accountability for acting on the feedback. These characteristics are important for ensuring MSF will result in desired behavior change in the focal leaders (Bracken et al., 2001). Each of these characteristics is discussed below:

#### **2.3.1 Awareness of the Feedback**

Awareness involves bringing the information to the attention of the leaders. Thus, they must be aware of the feedback before they can act on it. Awareness of their feedback is required before leaders will recognize their weaknesses and take action to correct them. Awareness of the feedback includes reactions and receptivity to the feedback by the recipients. Reactions can range from being pleased with the feedback to feeling hurt and resentment. A leader's health and psychological well-being may be negatively affected by receiving unfavorable feedback (Nowack, 2019). Feedback coaches play an important role in helping leaders work through any emotional reactions (Fleenor et al., 2020).

Receptivity relates to a leader's psychological readiness to receive the feedback. It is positively related to both, emotional intelligence and perceptions of the feedback environment (Dahling et al., 2012). Additionally, research indicates that feedback orientation, which is the degree to which a leader is ready to receive the feedback, can predict the leader's emotional reactions to their feedback (Braddy et al., 2013).

#### **2.3.2 Acceptance of the Feedback**

Acceptance is the leaders' belief that the feedback is an accurate description of their behavior (Ilgen et al., 1979). A key event occurs when the leader decides to accept the feedback as valid and useful information. For the feedback to be accepted, a leader must be aware of and receptive to it. When the feedback is not accepted, no behavior change will result (Bracken & Rose, 2011). First-time MSF participants may experience shock, anger, and rejection of the feedback before finally accepting it (Brett & Atwater, 2001). To ensure acceptance, resources for assisting leaders in dealing with their feedback should be provided by the organization (e.g., coaches, workshops, developmental activities, etc.; Fleenor et al., 2020).

#### **2.3.3 Accountability for Acting on the Feedback**

Accountability for acting on the feedback is necessary for a sustainable MSF process. This requires organizations to ensure leaders will conduct improvement-oriented actions on their feedback. Methods for ensuring accountability include the full support of the leader's manager for the MSF process and providing access to developmental resources such as new job assignments and training (London, 2003). Accountability is the major component for moving from acceptance to improved leader effectiveness (Bracken & Rotolo, 2019).

A successful MSF process requires full accountability, not only from the leaders, but also from other groups involved, namely, raters, managers, and the organization (London et al., 1997). If raters believe leaders are not being held accountable for acting on their feedback, they will be less likely to provide effective feedback in future MSF administrations. On the other hand, when raters see their feedback is being used productively, they can be expected to continue to provide accurate, honest feedback (Bracken & Rotolo, 2019).

# **3 Reliability and Validity of Multisource Feedback**

There are a number of factors that affect the validity of an MSF implementation (Bracken et al., 2001). These factors are directly related to the characteristics of a successful MSF process (Bracken & Rotolo, 2019). For MSF, the conceptualization of validity (e.g., content, construct, and criterion-related validity) is more complex than traditional notions of validity that arose from controlled, standardized settings such as intelligence testing. In those settings, validity was determined by a single measurement event in which an individual responds to the items on an assessment (i.e., single-source data). MSF depends on the collection of data from potentially unreliable sources (i.e., multiple raters). It is a complex process with the characteristics of both psychometric testing and large-scale data collection (Fleenor, 2019).

# *3.1 Validity Factors in Multisource Feedback*

The primary factors that affect the validity of MSF are described below and summarized in Table 1 with design recommendations from Bracken et al. (2001):


#### **3.1.1 Alignment**

This is the traditional definition of content validity— the extent to which the feedback (e.g., competencies, behaviors) is important for success in the organization (Bracken & Rotolo, 2019). If the competencies being measured are not related to success, then the content validity of the process is deficient. Alignment occurs when the values and goals of the organization are translated into a set of competencies for the entire organization (Campion et al., 2019).

## **3.1.2 Accuracy**

This includes the process of accurately collecting and processing data, and reporting the feedback. Errors in the feedback reports can negatively affect leaders' confidence in the process. Considerations for increasing accuracy include scoring systems with the capacity to handle high volumes of data with secure reporting, quality control to eliminate errors, and pre-populating of demographic data.

## **3.1.3 Clarity**

Raters must be given instructions on how to correctly complete the assessment and return it on time. Errors typically made by raters include miscoding the person they are rating, misusing the response scale, and providing inappropriate write-in comments. To increase clarity, orientation sessions should be held with the raters to increase their understanding of the process.

# **3.1.4 Cooperation**

The quality of MSF depends on the willingness of the raters to fully participate and provide reliable responses. Design features that affect this factor are related to the magnitude of the task, such as instrument length and the number of surveys a rater must complete. Indicators of low cooperation include unreturned or incomplete surveys and the effects of rater fatigue on the feedback. A simple metric to evaluate cooperation is the overall organization-wide response rate. If less than 75% of the surveys are completed, this should be the reason for concern (Bracken et al., 2001).

## **3.1.5 Timeliness**

Timeliness in providing feedback is an important factor to ensure acceptance of the feedback by the participant. Delays in providing results coupled with recency effects in the ratings may result in feedback that is no longer valid. This can have implications for how effective the feedback is in addressing the needs of the participant and the organization (Bracken et al., 2001).

#### **3.1.6 Reliability**

In this context, reliability refers to how dependably or consistently MSF measures the competencies on the assessment. This factor includes the importance of reliability in MSF, how it should be measured, and what level of reliability is acceptable (see Pulakos & Rose, 2019).

Some commonly-used reliability indices may not be appropriate for MSF ratings. For example, test–retest reliabilities may be affected by changes in the raters themselves (e.g., attitudes and opportunity to observe). The raters at Time 1 are often different than the raters at Time 2—with less than a 75% overlap in raters, the results can be misleading. Therefore, it is not recommended that test–retest reliability be used withMSF (Bracken et al., 2001). Internal consistency reliability (e.g., coefficient alpha) provides evidence that items on a scale (i.e., dimension or competency) are internally reliable. Often poorly written items negatively affect the internal consistency reliability of MSF ratings. The use of double and triple-barreled items in an attempt to shorten the length of surveys can also reduce the reliabilities. Overall, low reliabilities can obscure the meaningful interpretation of the feedback (Fleenor, 2019). Other factors that affect the internal consistency of MSF ratings include the misinterpretation of the rating scale by the raters. Typically, 5-to-7 point Likert scales with clearly defined anchors are recommended (Bracken & Rotolo, 2019).

Interrater reliability is used to determine the agreement within rater groups. Moderate levels of interrater reliability within these groups have been reported (e.g., Brett & Atwater, 2001). However, direct report ratings are often found to have the lowest reliabilities (Braddy et al., 2014). To increase the reliabilities within rater groups, all eligible raters should be used, particularly all direct reports. In general, more raters will result in more reliable ratings (Fleenor, 2019).

Typically, the correlation between the ratings from the various rater groups has been found to be low (Tornow, 1993). However, the reason for conducting MSF is to bring different perspectives of a leader's performance to the process. While the rater groups may disagree, each group may have a valid perspective of a leader's performance because leaders often interact differently with the various groups. For example, a leader may be interpersonally warm with peers, but cold and distant with direct reports.

#### **3.1.7 Insight**

Leaders should be provided with the necessary amount of information needed to take actions that are aligned with their feedback. The format of the assessment and feedback report should be designed to maximize participants' understanding of their results.

Feedback should be provided at the item level—not just at the competency (i.e., scale) level. With item-level feedback, leaders have a basis for determining the specific behaviors (i.e., the items) that resulted in their ratings. Processes that collect written comments cannot be expected to replace item-level feedback and may even increase the burden on the raters by requiring them to provide detailed descriptions of a leader's behavior. This may result in "rater fatigue," especially when raters are required to complete MSF assessments for several employees (Rose et al., 2004).

# *3.2 Self-other Rating Agreement in Multisource Feedback*

Often with MSF, self-ratings are found to differ significantly from the ratings of others (Fleenor et al., 2010). For example, individuals with high self-esteem may over-rate themselves relative to others' ratings of them. For this reason, the use of selfratings alone is not recommended. However, the level of agreement between self-and others' ratings, can provide important and useful information (Furnham, 2019). There appears to be a relationship between self-other agreement and leader effectiveness. In general, leaders who rate themselves similarly to others (in-agreement raters) appear to be more effective than leaders who rate themselves differently (Fleenor et al., 2010). However, the relationship between self-other rating agreement and leader effectiveness is non-linear. For example, leaders who under-rate themselves appear to be more effective than those who over-rate themselves (Braddy et al., 2014). Similarly, teachers who under-rate themselves are likely to be more effective than teachers who over-rate themselves.

For MSF, the challenge is to develop a relatively simple index of self-other rating agreement that participants can easily understand in their feedback reports. For example, such an index would categorize a leader as an under-rater, in-agreement rater, or over-rater.

# **4 Recommendations for Facilitating Multisource Feedback**

Best practices suggest that a confidential one-on-one feedback session be conducted between the leader and a coach. The coach provides an introduction to the MSF assessment, an analysis of the individual's feedback, and assists with developmental planning. These sessions are particularly important for leaders receiving feedback for the first time. They usually appreciate discussing their feedback with an experienced coach. The coach helps the participant understand that conflicting ratings may be valid, and comparisons between the different rating sources are important (Fleenor et al., 2020).

Leaders must be given adequate time to process their feedback before the oneon-one session. Unfortunately, some organizations distribute the reports and allow the leaders only a few minutes to look over their results prior their feedback session. Without time to reflect on their report and process any immediate emotional reactions to the data, leaders may not be ready to fully accept the feedback.

The coach should prepare for the session in advance by thoroughly reviewing the feedback report. The session should be held in a private room and leaders should be given the opportunity to audio record their session, which will serve as a useful resource for participants to review progress on their development plans.

# **5 Why Multisource Feedback Processes Fail**

While best practices are fully documented in the literature (Fleenor et al., 2020), practitioners continue to struggle with implementation issues with MSF. These issues affect the quality of the feedback (e.g., validity, reliability, accuracy) and therefore the future success of MSF processes. Many of these issues can be avoided by careful design, planning, and follow-up. A number of problems are common to failed MSF implementations (Fleenor et al., 2020):


# **6 The Transferability of Multisource Feedback Research to Student-to-Teacher Feedback in Schools**

There seems to be considerable overlap between leading in organizations and teaching in schools. Schools themselves are, of course, organizations with particular hierarchies and cultures relevant to the educational context. Teachers could be considered as the "leaders" of the students, and the students as the "direct reports" of the teachers. As discussed previously, the most relevant organizational counterpart to student-to-teacher feedback is upward feedback. In upward feedback, ratings are solicited from the direct reports of the leader being assessed because they are thought to be in the best position to judge the leader's effectiveness. In the same vein, students are probably in the best position to judge a teachers' effectiveness. The upward feedback model could be expanded to include self-ratings and the ratings of others (peers and leaders such as lead teachers or principals). This would result in full "360-degree" feedback, which may provide more valid and reliable feedback for teachers.

For teachers, MSF would provide access to structured feedback from students on their teaching quality, a source of feedback they rarely receive. Given that students have a unique perspective of teacher effectiveness, such feedback could be helpful for teachers who want to improve their teaching effectiveness. Administering MSF in the classroom has an advantage over the typical organization because teachers are likely to have mostly the same students during the school year. In organizations, a leader's direct reports may change frequently because of reorganizations, reassignments, and turnover. Because of the relatively stable rater population, teachers can be more easily evaluated over time to determine how much they have improved.

If administered as a full 360-degree feedback model, MSF would provide teachers with multiple perspectives of their performance. They would be able to compare their ratings from various sources (students, peers, principals) to determine if they are perceived differently by these groups. Teachers would be able to compare their self-ratings to the ratings of others to see if they have an unrealistic view of their own performance.

Following the feedback facilitation model typically used in organizations, lead teachers or principals could act as coaches for the teachers and assist them with digesting their feedback and developing plans for acting on the feedback. Teachers would create development plans for improving in areas of weaknesses and leveraging their strengths. Additionally, the use of a validated teacher competency model would inform them of what capabilities are needed to be an effective teacher. In schools, MSF should be administered on a regular basis so the teachers' performance can be evaluated over time.

There is some evidence, however, that providing MSF alone may not result in sustained behavior change for teachers. For example, Bijlsma et al. (2019) found that teachers did not improve their teaching quality in response to student feedback received via smartphones. As a result of the student feedback, however, the teachers did gain more insight into how they could improve and reported improvementoriented efforts in response to the feedback. For student feedback to be more effective in creating sustained behavior change in teachers, Bijlsma et al. recommend that:


The above recommendations align closely with the characteristics of effective MSF in organizations. The basic tenets of MSF in organizations, therefore, should also be applied to student-to-teacher feedback:


In summary, if the recommendations discussed in this chapter are closely followed, then student–teacher feedback is more likely to be successful in schools.

# **References**


**John W. Fleenor** is a Senior Fellow at the Center for Creative Leadership in Greensboro, NC (USA). Additionally, he is an Adjunct Associate Professor of Psychology at North Carolina State University. For over 20 years he has conducted research on measures of leadership, focusing on multi-rater (360-degree) feedback. He has published extensively on this topic, including six books. He is a Fellow of the Society for Industrial and Organizational Psychology (SIOP).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 15 Lessons Learned from Research on Student Evaluation of Teaching in Higher Education**

## **Bob Uttl**

**Abstract** In higher education, anonymous student evaluation of teaching (SET) ratings are used to measure faculty's teaching effectiveness and to make high-stakes decisions about hiring, firing, promotion, merit pay, and teaching awards. SET have many desirable properties: SET are quick and cheap to collect, SET means and standard deviations give aura of precision and scientific validity, and SET provide tangible seemingly objective numbers for both high-stake decisions and public accountability purposes. Unfortunately, SET as a measure of teaching effectiveness are fatally flawed. First, experts cannot agree what effective teaching is. They only agree that effective teaching ought to result in learning. Second, SET do not measure faculty's teaching effectiveness as students do not learn more from more highly rated professors. Third, SET depend on many teaching effectiveness irrelevant factors (TEIFs) not attributable to the professor (e.g., students' intelligence, students' prior knowledge, class size, subject). Fourth, SET are influenced by student preference factors (SPFs) whose consideration violates human rights legislation (e.g., ethnicity, accent). Fifth, SET are easily manipulated by chocolates, course easiness, and other incentives. However, student ratings of professors can be used for very limited purposes such as formative feedback and raising alarm about ineffective teaching practices.

**Keywords** Student evaluation of teaching · SET · Validity · Teaching effectiveness

# **1 Introduction**

In higher education, anonymous student evaluation of teaching (SET) are used to measure the teaching effectiveness of faculty members and to make high-stakes decisions about them, such as hiring, firing, promotion, tenure, merit pay, and teaching awards (Uttl et al., 2017). If available to students, they are also used by students for course selection in the same manner as the popular website www.ratemyprofessor. com (RMP). SET have their allure: (a) SET are quick and cheap to administer; (b)

© The Author(s) 2021 W. Rollett et al. (eds.), *Student Feedback on Teaching in Schools*, https://doi.org/10.1007/978-3-030-75150-0\_15

B. Uttl (B)

Mount Royal University, Calgary, AB, Canada e-mail: buttl@mtroyal.ca

SET means and standard deviations give an aura of precision and scientific validity; and (c) SET provide tangible seemingly objective numbers for high-stake decisions and public accountability purposes. However, a still little known legal case from Ryerson University in Toronto (Ryerson University v. Ryerson Faculty Association, 2018 CanLII 58446, available at www.canlii.org) is a wake-up call about the uninformed use of SET, and reminder that SET are not valid as a measure of faculty's teaching effectiveness. In this chapter, I review the evidence against SET, evidence showing that they do not measure teaching effectiveness, vary predictably across factors completely irrelevant to faculty's teaching effectiveness, and can be raised with something as small as a Hershey kiss. I will also argue that the widespread use of SET may be one of the main contributors to grade inflation, driving up grades over the past 30 years, during a time period when time-spent studying has been steadily decreasing and the proportion of high school students entering colleges and universities increasing.

Typically, within the last few weeks of classes, students are asked to rate professors on various scales. A university evaluation unit then summarizes the ratings for each class and, after the classes are over and the final grades assigned, various statistical summaries including means and standard deviations are then provided to faculty and their administrators. These summaries may include departmental, faculty, or university "norms," such as the means and standard deviations of all course means within the department, faculty, and/or university. These summaries are then used as the key, if not sole, evidence of faculty teaching effectiveness (Uttl et al., 2017).

At the same time, no standards for satisfactory SET ratings are provided to anyone. Evaluators —chairs, deans, tenure and promotion committees, provosts, and presidents—use their own individual standards to arrive at their decisions about faculty teaching effectiveness. It is not uncommon for these evaluators to believe that faculty members falling below the mean are unsatisfactory and in need of improving their teaching. Moreover, these evaluators change periodically and unpredictably, even within the typical six-year time frame between a faculty member's initial hiring and eventual decision about promotion and/or tenure.

There are three types of commonly-used SET tools—those that are developed in-house by an institution, those that are obtained for free, such as the SEEQ (Marsh, 1980, 1991), and those that are developed commercially for purchase, such as the ETS SIR-II sold by the Education Testing Service (www.ets.org), the IDEA SRI sold by IDEA Center (www.ideaedu.org), and the CIEQ sold by C.O.D.E.S Inc. (www.cieq.com). In all of these systems, faculty's SET ratings are often compared to the departmental, faculty, university, or "national norms" (i.e., the average SET ratings for all institutions that purchased a particular commercial SET system). The commercial systems also give some guidelines on interpretation of SET. For example, the C.O.D.E.S. Inc guidelines specify that faculty scoring below the 70th percentile need at least "some improvement," implying that only the top 30% of faculty with the highest SET ratings are good enough and need "no improvement" (see www. cieq.com/faq). Notably, all of the commercial systems are explicitly intended to be used for both faculty development (formative uses) and for high-stakes personnel decisions (sumative uses) and their developers believe that they are valid measures of teaching effectiveness.

The focus on norm-referenced interpretation of SET ratings, requiring faculty to place above the 30th, 50th, or even 70th percentile, to avoid criticism of their teaching, will always, by definition, result in large proportions of unsatisfactory and in "need of at least some improvement" faculty members. Assuming few faculty members want to be labeled unsatisfactory or in "need of at least some improvement", this type of norm-referenced interpretation of SET sets up and fuels a race among faculty members to reach as high of ratings as possible. By definition, depending on the specific percentile cut-offs, 30, 50, or 70% of the faculty will lose this race. The higher the percentile cut off, the more intense and more high-stakes the race becomes.

Regardless of the specific percentile cut-offs for "unsatisfactory" or in "need of some improvement" labels, some proponents of SET ratings also argue that SET identify faculty members who successfully match their academic standards, teaching demands, and workload to students' abilities. For example, in response to arguments that SET are responsible for grade inflation and work deflation, Abrami and d'Apollonia (1999) argued:

academic standards that are too high may be as detrimental to the learning of students as academic standards that are too low. The arts and science of good teaching is finding the balance between what students might learn and what students are capable of learning. We believe that ratings help identify those instructors who do this well. (p. 520)

As Uttl et al. (2017) observed, in Abrami and d'Apollonia's (1999) view, SET are an appropriate standards meter allowing professors to determine what students' perceive to be an appropriate workload, appropriate amount to learn for specific grades, and, in short, a proxy of appropriate academic standards from the students' perspective. Professors who get high SET ratings are appropriately matching their standards to students' standards and professors who get low SET ratings are failing to do so.

# **2 SET Are an Invalid Measure of Faculty Teaching Effectiveness**

Are SET a valid measure of faculty teaching effectiveness? Do students learn more from more highly rated professors? If SET are a valid measure of faculty's teaching effectiveness, SET ought to strongly correlate with student achievement attributable to the professors' teaching styles, and ought not to be influenced by teaching effectiveness irrelevant factors (TEIFs) such as students' intelligence, cognitive ability, prior knowledge, motivation, interest, subject field, class size, class meeting time, etc. SET also ought not to be influenced by certain students preference factors (SPFs) such as professors' hotness/attractiveness, age, gender, accent, nationality, ethnicity, race, disability, etc., whose consideration runs afoul to human rights legislation. Finally, SET ought not be influenced by ill-advised or detrimental to student learning factors (DSLFs) such as professors reducing workloads, inflating grades, and distributing chocolates and cookies. Review of the literature, however, now convincingly shows that SETs are not a valid measure of teaching effectiveness, that students do not learn more from more highly rated professors and that SET are substantially influenced by numerous TEIFs, SPFs, and DSLFs.

# *2.1 There Is No Widely Accepted Definition of Effective Teaching*

The first fundamental problem in assessing the validity of SET as a measure of faculty teaching effectiveness is that professors, administrators, and even experts do not agree on what effective teaching is (Uttl et al., 2017). In turn, experts do not even agree on which teaching methods are effective and which specific teaching behaviors amount to effective teaching. For example, some professors, administrators, and experts believe that teaching methods such as unannounced pop quizzes, questioning students in front of their peers, and encouraging student attendance by leaving out words or phrases from lecture slides are effective teaching methods. In contrast, others believe that these same methods are insensitive, anxiety-producing, and even demeaning, disrespectful, and detrimental to student learning.

In the absence of an agreed upon definition, it is impossible to measure effective teaching. However, the experts do agree that effective teaching ought to result in student learning (Uttl et al., 2017). Accordingly, studies attempting to establish the validity of SET as a measure of effective teaching have focused on determining the correlation between professors' mean class SET ratings and student achievement.

# *2.2 Students Do Not Learn More from More Highly Rated Professors*

For nearly 40 years, the key evidence cited to support the validity of SET as a measure of faculty teaching effectiveness have been multisection studies that examine the correlations between the mean class SET and the mean class student achievement on common exams. An ideal multisection study has several critical features: (a) it examines the correlation between SET and student achievement in a large course split into numerous smaller sections, with each section taught by a different professor, (b) professors follow the same course outline, use the same assessments, and the same final exam, (c) students are randomly assigned to the sections, and (d) SET are administered prior to the final exam at the same time to all sections. In this design, if students learn more from more highly rated professors, the sections' average SET ratings ought to be highly correlated with sections' average final exam scores. Experts have generally agreed that multisection studies are the strongest evidence for determining the validity of SET as a measure of professors' teaching effectiveness, that is, professors' contribution to students' learning (Uttl et al., 2017).

Cohen (1981) published the first meta-analysis of 67 multisection studies available to that date and reported a small-to-moderate SET/learning correlation *r* = 0.43. Cohen concluded: "The results of the meta-analysis provide strong support for the validity of student ratings as a measure of teaching effectiveness" (p. 281) and continued: "we can safely say that student ratings of instructions are a valid index of instructional effectiveness. Students do a pretty good job of distinguishing among teachers on the basis of how much they have learned" (p. 305). Cohen's findings and conclusions have subsequently been cited over 1,000 times as evidence of SET validity as a measure of faculty teaching effectiveness (Web of Science, Google Scholar).

However, Uttl et al. (2017) recently demonstrated that Cohen's (1981) conclusions were unwarranted, and the result of flawed methods and data analyses. Most critically, Cohen disregarded the sample sizes of primary studied in his meta-analysis. In doing so, he gave equal weight to many small sample sized studies as he gave to fewer larger sample sized studies. Compounding this problem, Cohen also failed to take into account small sample size bias clearly visible from scatterplots of SET/learning correlations as a function of sample size. After taking into account small sample size bias, the best estimate of SET/learning correlation was only *r* = 0.27, substantially less then *r* = 0.43 reported by Cohen. Uttl et al. (2017) reported a new updated analysis of 97 multisection studies. Figure 1, Panel A, shows the results of Uttl et al.'s new updated meta-analysis based on 97 multisection studies. It confirms the strong small sample size bias already visible in Cohen's (1981) data set. Taking into account the small sample size bias, the best estimate of SET/learning correlation from this new meta-analysis is *r* = 0.08. Panel B shows the Uttl et al. results but only for studies that adjusted the SET/learning correlations for prior learning/ability. The best estimate of SET/learning correlations taking into account both the small sample size bias and prior learning/ability is nearly zero, *r* = −0.02. Accordingly, taking into account small sample size bias and prior learning/ability, the multisection studies demonstrate that SET/learning correlations are zero. In other words, students do not learn more from more highly rated professors.

# *2.3 SET Are Influenced by Many Teaching Effectiveness Irrelevant Factors*

SET correlate with numerous TEIFs such as students' intelligence, cognitive ability, interest, and motivation; subject field; class size; etc.

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 50 100 150 −1.0 −0.5 0.0 0.5 1.0 **a** Study Size SET/Learning Correlation ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0 50 100 150 −1.0 −0.5 0.0 0.5 1.0 **b** Study Size SET/Learning Correlation

**Fig. 1** The results of meta-analyses of multisection studies. Panel **A** shows the scatterplot of SET/learning correlations by study size for Uttl et al.'s (2017) new updated meta-analysis. After taking into account a small sample bias, the SET/learning correlation was only *r* = 0.08 for SET averages. Panel **B** shows Uttl et al. (2017) results but only for studies that adjusted the SET/learning correlations for prior learning/ability. After taking into account both the small sample size bias and prior/learning ability, the SET/learning correlation is nearly zero, *r* = −0.02

#### **2.3.1 Students Intelligence, Ability, and Kruger Dunning Effect**

Numerous studies have demonstrated that people are generally very poor in assessing their own cognitive abilities including attention, learning, and memory. Correlations between self-assessment of abilities and performance on objective tests of those abilities are generally close to zero (Uttl & Kibreab, 2011; Williams et al., 2017). Yet, many SET forms ask students to rate how much they learned from their professors.

Furthermore, as Kruger and Dunning (1999) demonstrated, people's selfassessment of their abilities depends on the abilities themselves. Those scoring low on objective ability tests hugely overestimated their performance whereas those scoring high on objective ability tests tended to underestimate their own performance. Moreover, low-ability individuals were less able to distinguish superior performance from inferior performance of their peers. As Kruger and Dunning observed, the incompetent are not only incompetent but their incompetence deprives them of the ability to recognize their own incompetence as well as the competence of others. It is selfevident that students who believe that their work deserved As or Bs but received Ds or Fs are unlikely to be satisfied and unlikely to give their professors high SET ratings.

#### **2.3.2 Student Interest and Motivation**

Hoyt and Lee (2002) reported SET ratings by student motivation and class size for the 20 items of the IDEA SRI. Student motivation was measured by a question "I really wanted to take this course regardless of who taught it." Collapsed across questions and class size, the least motivated students gave SET ratings that were 0.44 lower than those of the most motivated students, corresponding to an approximately 0.75 standard deviation difference. Moreover, this effect was substantial on each and every question, ranging from a 0.24 to 0.70 difference on a 1–5 rating scale.

#### **2.3.3 Course Subject**

Centra (2009) reported that the natural sciences, mathematics, engineering, and computer science courses were rated substantially lower, about 0.30 standard deviation lower, than courses in humanities such as English, history, and languages. Similarly, Beran and Violato (2009) reported that courses in natural science were rated 0.61 standard deviation lower than courses in social science. Surprisingly, Centra as well as Beran and Violato concluded that these effects were ignorable.

Using 14,872 course evaluation data from a US mid-sized university, Uttl and Smibert (2017) demonstrated that the differences in SET ratings between subjects such as English and Math are substantial (the difference between the means was 0.61 on a 5-point scale), and that professors teaching quantitative courses are far more likely to be labeled unsatisfactory when evaluated against common criteria for a satisfactory label. Figure 2 shows the distribution of SET ratings for Math, English, Psychology, History, and all courses. The distributions of math professors ratings are more normal and substantially shifted toward less than excellent ratings whereas the distribution of English, history, psychology, and all professor courses professors ratings are higher and positively skewed. Thus, if the same standards are applied to professors teaching quantitative vs. non-quantitative courses, professors teaching

quantitative courses are far more likely to be not hired, fired, not re-appointed, not promoted, not tenured, denied merit pay, and denied teaching awards.

Of course, the simple fact that professors teaching quantitative vs. non-quantitative courses receive lower SET ratings is not evidence that SETs are biased. It may be that professors teaching quantitative vs. non-quantitative courses are simply incompetent, less effective teachers. However, as pointed out by Uttl and Smibert (2017) this incompetence explanation is unlikely. A wealth of evidence strongly suggests that the lower ratings of professors teaching quantitative vs. non-quantitative courses is due to factors unrelated to professors themselves. First, the mathematical knowledge and numeracy abilities of populations worldwide have decreased over the years. For example, half of Canadians now score below the level required to fully participate in today's society (Orpwood & Brown, 2015). Second, Uttl et al. (2013) found that fewer than 10 out of 340 undergraduate students were "very interested" in taking any one of the three statistics courses offered in the psychology department at Mount Royal University. In contrast, 159 out of 340 were "very interested" in taking the Introduction to the Psychology of Abnormal Behavior. Thus, professors teaching statistics classes vs. abnormal psychology are facing students who differ vastly on one of the best predictors of student learning: interest in the subject.

#### **2.3.4 Class Size**

Armchair theorizing suggests that class size (i.e., the number of enrolled students) ought to be inversely related to SET ratings. Small classes, with 10, 20, or even 30 students, allow each student to have a far greater opportunity to interact with their professors. In contrast, in classes beyond 20 or 30 students, professors are unlikely to learn even student names. Surprisingly, in the first meta-analysis of SET/class size relationship, Feldman (1984) concluded that the average SET/class size correlation was only *r* = −0.09 (corresponding to *d* = −0.18). Fifteen years later, Aleamoni (1999) summarily declared the notion that class size can affect student ratings to be a myth. Another 10 years later, Gravestock and Gregor-Greenleaf (2008) concluded that "the correlation between class size and ratings is statistically insignificant and is therefore not viewed as having any impact on validity."

Our review (Uttl et al., 2018) of over 100 studies that examined the relationship between SET and class size, including those reviewed by Feldman (1984), revealed that the vast majority of these studies did not report sufficient information to interpret their findings. For example, many studies did not report the smallest class size, did not report the largest class size, did not report the number of classes within each class size category, did not examine the linearity of SET/class size relationship, did not examine whether there was a decline in SET for classes with fewer than 20 or 30 students, did not show scatterplots of SET/class size relationships, had very small sample sizes, included extreme outliers, etc.

When only studies that reported sufficient data to plot the relationship between SET and class size and examined, the decline in SET is initially steep and then levels off for class sizes between 30 and 50 students. The overall decline is about 0.5 point on 1–5 rating scale. When each study's data are standardized using the smallest class size group in each study as a reference group and the average standard deviation of SET means within each study, the declines in SET ratings to class sizes up to 30 or 50 students amount to about 0.5 standard deviation and that the declines continue even thereafter but at a much lower rate. Accordingly, disregarding uninterpretable studies, the evidence clearly shows that declines in SET ratings are steep as class size increases to 30–50 students, and that SET declines level off thereafter.

# **3 SET Are Influenced by Student Preference Factors (SPFs) Whose Consideration Violates Human Rights Legislation**

A substantial body of research has also reported that SET are influenced by factors whose consideration in high-stakes personnel decisions violates human rights legislation such as professors accent, nationality, ethnicity, race, age, gender, etc.

# *3.1 Attractiveness/Hotness*

Do students prefer attractive/hot young professors to unattractive/not so hot professors? Using the www.ratemyprofessor.com rating data for 6,852 US faculty, Felton et al. (2008) found that Quality (average of Clarity and Helpfulness ratings) was strongly correlated with instructor Hotness (www.ratemyprofessor.com discontinued Hotness scale in 2018 in response to a social media campaign against it), *r* = 0.64. Hotness was similarly correlated with Helpfulness, *r* = 0.64, and Clarity, *r* = 0.60, and only moderately correlated with Easiness, *r* = 0.39. Accordingly, attractive/hot professors receive much higher ratings on Clarity, Helpfulness as well as Easiness. One may argue that www.ratemyprofessor.com is low quality data, unlike carefully designed SET. However, this argument fails for two reasons: First, www.ratemy professor.com Overall Quality ratings correlate highly with in class instructor SET ratings with *r*s ranging from 0.66 to 0.69 (Coladarci & Kornfield, 2007; Sonntag et al., 2009; Timmerman, 2008). Second, www.ratemyprofessor.com ratings are affected by various TEIFs, SPFs, and IDSLFs just as SETs are.

# *3.2 Accent/Ethnicity/Nationality*

In one of the most extensive studies, Subtirelu (2015) examined the ratemyprofessor.com ratings of 2,192 professors with US last names vs. professors with Chinese or Korean last names teaching in the USA. Subtirelu found that professors with US last names received ratings 0.60–0.80 points higher (on 5-point scale) on Clarity and 0.16–0.40 points higher on Helpfulness.

# *3.3 Gender*

Hundreds of studies have examined gender differences in SET ratings. In general, gender differences in SET ratings are (a) minimal and (b) inconsistent. Moreover, most of the research has compared SET ratings of men vs. women within the university, faculty, or department. However, these studies are impossible to interpret because presence or absence of gender differences does not indicate the presence or absence of gender bias. Gender differences could arise, be reduced, or even masked by a number of different factors including but not limited to gender differences in teaching ability, gender differences in ability to satisfy students, gender differences in courses taught by men vs. women (quantitative vs. non-quantitative, nursing vs. computer science), and gender differences in ability to bake tasty treats for students (see below). However, three recent studies have claimed to show a large bias against female professors and have been widely cited for this claim: Boring (2015, 2017), MacNell et al. (2015), and Mitchell and Martin (2018). However, a detailed review of these studies does not support their authors' conclusions as detailed below.

Boring (2015, 2017) examined gender differences in SET ratings using a French university's SET ratings of 372 fixed contract instructors teaching seminar sections of introductory courses. Boring found that male teachers received slightly higher ratings than female teachers, mainly because male students rated male teachers somewhat higher than female students (3.20 vs. 3.06 corresponding to approximately 0.2 SD). A re-analysis of Boring's (2015) data set by Boring et al. (2016) shows that the SET/Instructor Gender correlation was only 0.09, corresponding to approximately 0.2 SD. Accordingly, the Boring et al. data suggest that gender differences are small rather than large. However, Boring's data set does not allow the conclusion that the relatively small differences in SET ratings are evidence of bias against female teachers for at least the following reasons: First, the students were not randomly assigned to seminar sections. For example, students selected whether they took early morning, mid morning, noon, mid afternoon, or late afternoon sections. Second, the students knew the grades given to them by their teachers before they completed SET. Third, there were substantial differences in the experience of female vs. male teachers. For example, a much larger proportion of male teachers had expertise in the field whereas a much larger proportion of female teachers were only PhD students. These experience differences alone could explain the small differences in SET ratings. Fourth, the seminar section teachers were free to teach their section whichever way they liked, used different assignments, etc., and thus, it is impossible to attribute the small differences in ratings to bias.

MacNell et al. (2015) examined the SET ratings of one female and one male instructor of an online course when students were either truthfully told the gender of each instructor (True Gender) or when students were misled about the instructors' genders (and told that each instructor's gender was in fact the opposite of what it was) (False Gender). Both instructors interacted with their students exclusively online, through discussion boards and emails; graded students work at the same time; used the same grading rubrics and co-ordinated their grading to ensure that grading was equitable in their sections. Based on the results of their experiment, MacNell et al. concluded that "Students rated the male identity significantly higher than the female identity, regardless of the instructor's actual gender." However, MacNell et al.'s data suffer from several fundamental flaws that render them uninterpretable and MacNell et al.'s conclusions unwarranted (Uttl & Violo, 2021). First, MacNell et al.'s sample of students in each of the four conditions was extremely small, ranging from 8 to 12 students. Second, MacNell et al.'s conclusions depend on three outliers in their small data set—three students who gave their instructors the lowest possible rating on all or nearly all items. When the three outliers are removed from the data set, students rated the actual female instructor numerically higher than the actual male instructor regardless of whether the students were given the actual or false gender of the instructors. Third, MacNell et al.'s study included only one female and one male instructor. It is unwarranted to draw inferences from this small sample size in one study to how students rate female vs. male instructors in general.

Similarly, Mitchell and Martin (2018) examined SET ratings of one female (Mitchell) and one male (Martin) professor teaching different sections of the same online course and found that "a male instructor administering an identical course as a female instructor receives higher ordinal scores in teaching evaluations." Mitchell and Martin argued that their findings were evidence of gender bias as "the only difference in the courses was the identity of the instructor." However, the sections differed or may have differed in many aspects: (a) students' work was graded by different graders whose strictness varied, (b) Drs. Mitchell and Martin held face to face office hours, (c) Drs. Mitchell and Martin may have had different email styles, (c) Mitchell's ratings were based on approximately three times as many responses as Martin's ratings, (d) Mitchell and Martin may have taught at different times of the day, etc. Moreover, Mitchell and Martin's argument that questions in Instructor/Course, Course, and Technology related to "characteristics that are specific to the course" and do not vary across the sections is simply incorrect. The questions in these categories asked, for example, what "the instructor" did and it ought to be self-evident that different instructors may do things differently, and thus, differences in ratings need not reflect gender bias. Finally, and importantly, just as with MacNell et al. (2015) study, one ought not to make sweeping conclusions about how two categories differ based on differences between two exemplars, one drawn from each of the two categories. This sample size equals one type of research is unlikely to describe what two populations are like.

# **4 SET Are Influenced by Chocolates, Course Easiness, and Other Incentives**

SET ratings are also influenced by numerous factors whose consideration in evaluation of faculty is ill-advised or detrimental to student learning including course difficulty; distribution of chocolates, cookies, and tasty baked goods; and non enforcement of course policies including academic dishonesty and student codes of conduct policies.

# *4.1 Course Difficulty*

Using data for 3,190 professors from US universities, Felton et al. (2004) found a moderately strong correlation between Quality and Easiness of 0.61. Moreover, the Quality/Easiness relationship became stronger as more ratings were available for each faculty member. Whereas for professors with 10–19 ratings, the Quality/Easiness correlation was 0.61, the correlation reached 0.76 for faculty with 50–59 ratings. The moderate to strong relationship between Quality and Easiness has been subsequently replicated by a number of studies including Felton et al. (2008), Rosen (2018), and Wallisch and Cachia (2019). Wallisch and Cachia (2019) confirmed a steep and accelerated decline of www.ratemyprofessor.com Overall Quality ratings (rated on 1–5 point scale) with increasing Course Difficulty (reverse of easiness) (rated on a 1–5 point scale). For each 1.0 point increase in Difficulty, Overall Quality ratings decreased by approximately 0.6 points.

# *4.2 Chocolates and Cookies*

Two randomized studies demonstrate the power of chocolates and cookies in improving SET ratings. In one of the earlier randomized studies, Youmans and Jee (2007) examined whether providing small chocolate bars would result in higher SET ratings in two statistics and one research methods class. Students who were offered chocolate bars rated their instructor substantially higher than students who were not offered chocolate bars (*d* = 0.33). In another randomized study, Hessler et al. (2018) conducted a single-center randomized control group trial to determine whether the availability of chocolate cookies affects SET ratings. Relative to the nocookie groups, the cookie groups rated teachers as well as the course material much higher, *d* = 0.68 and *d* = 0.66, respectively. Accordingly, at minimum, chocolates and chocolate cookies are both very effective ways to increase one's SET ratings.

# **5 SET Findings Vary with Conflict of Interest**

Uttl et al. (2019) have recently shown that the correlations between SET and learning/achievement in the multisection studies discussed above depend not only on their sample size but also on their authors' degree of conflict of interest (perceived or actual). Figure 3, Panel A shows that authors with SET corporations (Corp) reported much higher SET/learning correlations than authors with no such ties, *r* = 0.58 vs. *r* = 0.18, respectively. However, as shown in Panel B, conflict of interest is not limited to authors with direct financial gains from selling SET but also extends to authors with other non financial conflicts of interest such as administrative (Admin) and evaluation units (Eval U) ties. These findings are particularly troubling; they suggest that in addition to the poor methodology employed by many SET studies (e.g., small sample sizes, insufficient method descriptions, failure to consider outliers), many SET research findings may also be the result of their authors financial and other interests, whether these biases were conscious or unconscious.

# **6 Discussion**

SET do not measure teaching effectiveness and students do not learn more from more highly rated professors. Until recently, meta-analyses of multisection studies have been cited as the best evidence of SET validity. Those meta-analyses, however, were fundamentally flawed. The re-analyses of the previous meta-analyses as well as the new updated meta-analyses of multisection studies show that SET are unrelated to student learning in multisection designs. Accordingly, SET ought not to be used to measure faculty's teaching effectiveness.

**Fig. 3** SET/learning correlations and conflict of interest. Panel **A** shows that authors with SET corporations (Corp) reported much higher SET/learning correlations than authors with no such corporate ties. Panel **B** shows that authors with other conflicts of interests including administrative (Admin) and evaluation units (Eval U) also reported higher SET/learning correlations whereas authors with no identifiable conflicts of interest reported near zero SET/learning correlations (the figures are adapted from Uttl et al. 2019)

Regardless of what SET actually measure, SET are substantially influenced by (1) numerous factors not attributable to professors, including students' intelligence and prior knowledge, students' motivation and interest, class size, and course subject; (2) factors attributable to professors but whose consideration in high-stakes personnel decisions violates human rights legislation such as accent, race, ethnicity, national origin, age, and hotness/sexiness; and (3) factors attributable to professors but whose consideration is at minimum unwise and/or detrimental to student learning, including course difficulty and availability of chocolates and cookies.

Although some SET systems attempt to adjust for influences of various TEIFs, SPFs, and DSLFs, these attempts are ultimately futile because no SET system can nor does adjust for all demonstrated effects of TEIFs, SPFs, and DSLFs, nor for effects of possible TEIFs, SPFs, and DSLFs. Even adjusting only for the factors reviewed above would likely be impossible. For example, to adjust for factors attributable to students, one would have to administer highly reliable and valid tests of student intelligence, prior knowledge, motivation, interest, racism, accent preference, hotness preferences, etc., then calculate average class intelligence, prior knowledge, motivation, interest, racism, accent preferences, hotness preferences, etc., and then develop some adjustment system. No one has done it so far and no one is likely to do so in the foreseeable future.

SET measure student satisfaction, that is, "a fulfillment of need or want" or "a happy or pleased feeling because of something that you did or something that happened to you" (www.m-w.com). One may argue that student satisfaction is important and that student satisfaction is properly used or ought to be used in high-stakes personnel decisions such as hiring, firing, promotion, merit pay, and teaching awards. However, the fundamental problem with using student satisfaction at all to evaluate faculty is that it depends on factors not attributable to professors.

Moreover, making high-stakes personnel decisions by comparing faculty's SET ratings to university, faculty, or departmental norms, sets up and fuels a race among faculty members to beat at least 30, 50 or, 70% of their colleagues depending on the particular norm-referenced criteria for unsatisfactory, "in need of improvement", etc. adopted by their institution. This race for higher and higher SET ratings is what a number of writers believe is the principal cause of run-away grade inflation and work deflation (Crumbley & Reichelt, 2009; Emery et al., 2003; Haskell, 1997; Stroebe, 2016, 2020). Although SET were relatively rare prior to 1970s, today they are used by almost all colleges and universities in North America and in many other countries to evaluate teaching effectiveness (Seldin, 1993). Accordingly, the race for higher SET pressures faculty to satisfy their students' needs and wants, in particular, to increase grades, reduce workload, tolerate academic dishonesty, avoid topics that may antagonize some students, etc. Indeed, the grades have been increasing and have risen from C grades being the most frequently awarded grades in 1970s to A grades being the most frequently awarded grades today (Rojstaczer & Healy, 2010, 2012). At the same time, students report spending less and less time on their studies (Fosnacht et al., 2018; Rojstaczer & Healy, 2010). Whereas in the 1960s students in US were spending on average about 2 h studying outside of the class for each hour in the class, today students are spending only about 1 h. These two trends are nothing short of astonishing when one considers that the average intelligence and ability of students entering colleges and universities has declined over the last 50–100 years, as the proportions of high school graduates entering universities and colleges has increased from approximately 5% to more than 50% or even 70% depending on the country, state, and province (US Census Bureau, 2019). Notably, SET are not the only cause of grade inflation and work deflation. Other related causes include colleges and universities' focus on high student retention; pressure on professors to limit percentages of D, F, and W (withdrawal) grades (explicitly requiring professors to increase grades); and business culture that not only strives for happy customers whose needs and wants need to be satisfied but also for as many customers as possible.

In the first public legal case of its kind, Ryerson University was forbidden from using SET as a measure of teaching effectiveness (Ryerson University v. Ryerson Faculty Association, 2018 CanLII 58,446, available at www.canlii.org). The arbitrator Williams stated:

That evidence, as earlier noted, was virtually uncontradicted. It establishes, with little ambiguity, that a key tool in assessing teaching effectiveness is flawed, while the use of averages is fundamentally and irreparably flawed. It bears repeating: the expert evidence called by the Association was not challenged in any legally or factually significant way. As set out above, the assessment of teaching effectiveness is critical, for faculty and the University, and it has to be done right. The ubiquity of the [SET] tool is not a justification, in light of the evidence about its potential impact, for its continuation, or for mere tinkering.

The SET ratings also run afoul of at least some codes of ethics. For example, Canadian Code of Ethics for Psychologists (Canadian Psychological Association, 2017) makes it clear that psychologist not only has a duty to not participate in incompetent and unethical behavior, such as evaluating their colleagues using invalid and biased SET tools, they also have a responsibility to call out "incompetent and unethical behavior, including misinterpretations or misuses of psychological knowledge and techniques" (Ethical Standard IV.13).

Notwithstanding the above criticisms, student surveys may continue to be useful for formative uses, that is, for improving instruction when professors themselves design or select questions relevant to their teaching methods and courses, and when SET are provided only to professors themselves to ensure that they are not misused, not used for summative uses, and used only for formative uses or to raise alarm about some ineffective teaching behaviors (e.g., not showing up for one's classes).

Finally, and importantly, this review of SET research highlights the need for transparent, replicable, and methodologically strong research, conducted by researchers with no conflict of interest and no interest in particular findings. The SET literature is replete with unsubstantiated and contradictory findings based on poor methods. As detailed above, Cohen's (1981) widely cited evidence of SET validity turned out to be an artifact of poor methods and failure to take into account small sample bias and students' prior ability. Similarly, Feldman's (1984) finding of minimal effect of class size and Aleamoni's (1999) later dismissal of the idea that class size is related to SET ratings as a myth were similarly based on poor methods and failure to adequately review the previous findings. And MacNell et al. (2015) claim of gender bias against women hinges in its entirety on three outliers, three students who disliked their instructors so much as to give them the lowest possible rating on all or nearly all items. Significantly, as shown by Uttl et al. (2019), the reported findings may be greatly influenced by a conflict of interest. It is clear that any review of this literature needs to be approached with an attitude of a detective rather than simply accepting what is written in studies' abstracts in order to ferret out true findings supported by evidence from uninterpretable and unwarranted claims.

In conclusion, continued use of SET in high-stakes personnel decision such as hiring, firing, promotion, merit pay, and teaching award is not evidence based. The evidence is that (a) students do not learn more from more highly rated professors; (b) SET are biased by a variety of factors not attributable to professors; (c) SET run afoul to human rights legislation, and (d) SET are easily manipulated by small chocolates such as Hershey's kisses, course easiness, and other factors. In short, SET do not measure faculty's teaching effectiveness and their use in high-stakes personnel decisions is improper, unethical, and ought to be discontinued immediately.

# **References**


**Bob Uttl** is a Professor of Psychology at Mount Royal University, Calgary (Canada). Previously he held a variety of academic posts around the world including: Red Deer College, Canada; Tamagawa University, Japan; University of Tsukuba, Japan; Oregon State University, USA; Henry M. Jackson Foundation for Advancement of Military Medicine, USA; and National Institute of Health, USA. His research includes: the relationship between perception, processing resources, and prospective and retrospective memory; changes in cognitive functions due to normal and pathological aging; measurement and research methods; and evaluation of teaching in higher education.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Part IV Discussion and Future Directions**

# **Chapter 16 Student Feedback on Teaching in Schools: Current State of Research and Future Perspectives**

## **Wolfram Rollett, Hannah Bijlsma, and Sebastian Röhl**

**Abstract** The aim of this volume was to give a comprehensive overview of the current state of the research on student perceptions of and student feedback on teaching. This chapter provides a resume of the important theoretical considerations and empirical evidence the authors contributed to this volume. First, evidence concerning the validity of student perceptions of teaching quality is discussed, highlighting the quality of the questionnaires used and accompanying materials provided by their authors. In the next step, empirical findings are summarized on student and teacher characteristics that can influence important processes within the feedback cycle. Subsequently, it is emphasized that the effectiveness of student feedback on teaching is significantly related to the nature of the individual school's feedback culture. Furthermore, it is argued that the efficacy of student feedback depends on whether teachers are provided with a high level of support, when making use of the feedback information to improve their teaching practices. As the literature review impressively documents' teachers, teaching, and ultimately students can benefit substantially from student feedback on teaching in schools.

**Keywords** Student perception · Student feedback · Teaching quality · Teaching improvement · Validity

W. Rollett · S. Röhl (B) Institute for Educational Sciences, University of Education, Freiburg, Germany e-mail: sebastian.roehl@ph-freiburg.de

W. Rollett e-mail: wolfram.rollett@ph-freiburg.de

H. Bijlsma Section of Teacher Professionalization, University of Twente, Enschede, the Netherlands e-mail: h.j.e.bijlsma@utwente.nl

# **1 Introduction**

Although there exists a vast and differentiated literature about teachers' feedback to students in schools and the ways to make productive use of it (Hattie, 2009), feedback from students to teachers has received far less attention. The aim of this volume, therefore, was to present an informative overview of state-of-the-art research in this area and important neighboring scientific fields. Central topics discussed in this volume are whether student perceptions of teaching in school are reliable and valid, what has to be considered to obtain valid information, and how to successfully make use of it for the professional development of teaching and teachers.

As Hattie points out in his foreword to this volume, the knowledge of variables which may influence the success and effectiveness of feedback is rather critical. The Process Model of Student Feedback on Teaching (SFT) suggested by Röhl, Bijlsma, and Rollett (Chap. 1 of this volume) is an attempt to provide a framework which describes the feedback cycle in such a way that it can provide an orientation for research on the efficacy of student feedback as well as for the effective implementation of intervention measures. Particular focus is put on variables which characterize the affective and cognitive processing of students' feedback by the teachers and their readiness for considering improvement-orientated actions. A professional implementation of student feedback on teaching clearly has the potential to enrich the feedback and learning culture of schools substantially and, above and beyond that can contribute to their democratic culture. A corresponding approach is elaborated by Jones and Hall (Chap. 13 of this volume) advocating school and teaching practices of involving the "student voice," i.e., involving students in the planning and implementation of their own education. But, as Uttl (Chap. 15 of this volume) summarizes, findings from higher education show how potential dangers arise when student perceptions of teaching are collected with an evaluative focus.

In this final chapter, we summarize the findings and conclusions drawn from the chapters in this volume to give an overview of what we have achieved in research on student feedback, what needs to be considered when implementing student feedback in practice and where we see room for improvement. First, we discuss the validity of student perceptions of teaching quality and characteristics of survey instruments. Next, we highlight characteristics of students and teachers affecting and impacting teachers' feedback processes. We then discuss the organizational context of the evaluation and the presentation of the feedback information to stakeholders. Finally, we suggest directions forward for researchers, policymakers, and schools.

# **2 Validity of Student Perceptions of Teaching Quality and Characteristics of Survey Instruments**

Regarding the question of the validity of measurements and tests in educational contexts, it has been emphasized that validity can only be assessed in regard to the intended interpretation and subsequent actions (AERA, 2014; Kane, 2012). In this sense we discuss the validity of student perceptions of teaching in terms of their value for improving teaching practices in a formative setting and' at the same time' we disregard a purely evaluative use of student ratings on teaching (see Chap. 8 by Wisniewski and Zierer, and Chap. 15 by Uttl in this volume).

The literature review provided across the chapters of this volume impressively illustrates how teachers and teaching can benefit from making use of formative student feedback. Nevertheless, there are still many researchers and practitioners raising concerns about the accuracy and fairness of student ratings of teaching quality, which—if tenable—would considerably limit their value for the proposed usage in the development of teaching and teaching skills. Indeed, there are good reasons to be skeptical about the results of student ratings of teaching used in the field, and several contributions to this volume address the topic of a valid measurement of student perception of teaching quality.

An important issue in this context is how well students are able to evaluate teaching practices. The referenced literature on the prognostic validity of student feedback measures point to the result that student evaluations on teaching do indeed capture aspects of the teaching quality which are relevant for students' learning and development (e.g., Fauth et al., 2014; Praetorius et al., 2018; Wallace et al., 2016). As the analyses presented in this volume indicate, there is much which can be done to improve the measurement procedures and to increase the accuracy of student ratings. For example, Bijlsma et al. (Chap. 2 of this volume) point out that the underlying psychometric theory is determining how the rater's perception is conceptualized and captured. Göllner, Fauth, and Wagner (Chap. 7 of this volume) emphasize impressively that we have to be cautious and more aware about the way we ask students about their experiences in class. Different combinations of item referents (e.g., "I / We / The class understood the subject matter well") and item addressees (e.g., "The teachers explained the subject matter clearly to me / the class") are likely to induce different evaluation processes and different results, thus affecting reliability and validity of the measurements. Accordingly, Schweig and Martínez (Chap. 6 of this volume) call for evaluating within-classroom variability of student experiences as an indicator for disparate instructional experiences and unequal participation opportunities of the students. The authors strongly argue that evaluating within-classroom variability should be considered as a defining strength of the approach of using student-survey-based measures for the improvement of teaching.

One intensely discussed topic in the literature is the agreement or disagreement of the evaluations of students and observers (e.g., Clausen, 2002; Gitomer et al., 2014; Kuhfeld, 2017). van der Lans (Chap. 5 of this volume) provides important findings which may even have the potential to end this discussion. In his analyses, the results from students and observers converge when 25 students' and seven different observers' views are related to each other. Interestingly, the ordering of the item difficulties or teaching competences were very consistent across students or observer ratings, and could also be calibrated on the same continuum of instructional effectiveness (van der Lans et al., 2019). These findings indicate that the disagreement of students and observers often reported in the literature may be largely attributed to an insufficient number of observers.

But it is indisputable that the question has to be raised of how well students perceive different aspects of teaching quality, and how well they comprehend the corresponding items in the questionnaire they are processing. Unfortunately, little research has been done on these topics. Accordingly, Göllner and colleagues (Chap. 7 of this volume) call for studies on the students' cognitive processing of survey items and its influence on their evaluation of teaching practices, while highlighting the necessity of age- and development-appropriate survey instruments. A closer look at the topic of how well students comprehend the items in a questionnaire would improve the survey instruments substantially and subsequently enhance the validity of the feedback information. As a consequence, for example, Bijlsma et al. (under review) and Lenske (2016) intensely discuss the content of the items of their student perception questionnaires with students to make sure that they understood and interpreted the items well.

In their review of the literature, both Göllner et al. (Chap. 7 of this volume) and Röhl and Rollett (Chap. 3 of this volume), raise further questions of (1) whether students in school are actually able to distinguish between different teaching dimensions and (2) how reliable the ratings of different dimensions are. Students' lack of ability to differentiate between different teaching dimensions would lead to empirically simpler factorial structures (see also Kuhfeld, 2017). Indeed, it is quite typical to see a two-factor structure, where a general factor covers all theoretically distinguishable teaching quality constructs with the exception of classroom management (e.g., Wallace et al., 2016). As Röhl and Rollett (Chap. 3 of this volume) demonstrate, students' social perceptions of their teachers explain an important part of the common variance of different teaching quality dimensions in a second order factor model. These results indicate that students' evaluations of teaching quality might be influenced by their social perceptions of their teachers and so may lead to biased assessments. Their findings suggest to emphasize using items which are less likely to be confounded by the students' social perception of their teachers (e.g., by addressing the individual experiences in a specific lesson) and controlling for the impact of how students socially perceive their teachers can be counteracted by administering suitable scales. The literature, nevertheless, offers reliable information that the students' assessments of teaching quality dimensions show characteristics of differential predictive validity, indicating the existence of a meaningful unique variance (e.g., Klieme & Rakoczy, 2003; Raudenbush & Jean, 2014; Yi & Lee, 2017).

In order to make use of student perceptions of teaching quality, the design of the survey instruments is crucial and critically determines the nature of the perceptions. In her informative systematic review, Bijlsma (Chap. 4 of this volume) analyzes the quality of 22 student perception questionnaires on teaching quality, which leads to an extensive literature research. Overall, most of the instruments were evaluated positively regarding their theoretical foundation, their design, and the information about their statistical quality. The review revealed, however, weaknesses concerning norm information, sampling specifications, and the availability of more detailed information on the features of the instruments (e.g., by providing a user manual). The analyses illustrate how to put more emphasis on the quality of the presentation of the survey instruments to potential users.

As Röhl's review of the research (Chap. 9 of this volume) shows' there is a substantial amount of evidence for the effects of student feedback on teachers' behavior e.g., initiating reflective thinking processes, learning about students' perspectives, reviewing their goal setting, and changing their teaching practices accordingly. When provided with student ratings or feedback on their teaching, teachers also tend to engage more in communication with their classes on teaching practices and the changes which follow student feedback. An important pattern of results from a meta-analysis of intervention studies in schools presented by Röhl (ibid.) shows a mean weighted effect size of *d* = 0.21 for the impact of student feedback on student's perception of the subsequent lessons. But, as an in-depth analysis showed, this effect size underestimates the potential of student feedback: A high level of support provided to the teachers when making use of feedback information yielded a significantly larger positive effect of *d* = 0.52. Medium or low levels of support, though, did not result in a better outcome. This pattern of findings, therefore, highlights how crucial an adequate level of support is for the effectiveness of student feedback measures.

# **3 Student and Teacher Characteristics Influencing the Feedback Process**

As the research presented in this volume shows, student feedback on teaching can indeed provide a valuable basis for evaluating and improving teaching practices. It is not unusual for school students to welcome and value the opportunity to give teachers feedback regarding their teaching and to find their "student voice" recognized (Jones and Hall, Chap. 13 of this volume). Nevertheless, student ratings of teaching can be affected by a variety of student and class characteristics (Bijlsma et al., under review). For example, high performing students rate their teachers' teaching quality significantly higher than low and middle performing students. Students from socioeconomically or educationally more privileged families tend to be more critical of teaching practices (Atlay et al., 2019). Male students seem to be more critical than female students (Kuhfeld, 2017). Moreover, differences in students' language comprehension can affect whether items in the survey instrument are understood (Lenske, 2016). The perception or evaluation of teaching quality may differ by age or development stage (see Chap. 7 of this volume by Göllner et al.). Student ratings of teaching can also be influenced by certain individual teacher characteristics which are not systematically associated with differences in teaching performance—such as gender, age, or physical appearance; in more sophisticated evaluation contexts, procedures can be implemented to correct scores accordingly, but this is not typical. The expectation of erroneous results can be considered as the most frequent reason why teachers are reluctant to use feedback. Although, as Schweig and Martínez (Chap. 6 of this volume) conclude, "these biases are generally small in magnitude and do not greatly influence comparisons across teachers or student groups, or how aggregates relate with one another and with external variables." But users should, of course, be aware of the biases which might occur, especially as minor differences may have severe consequences in evaluative contexts.

Ways to counteract these undesirable effects and prepare students to use the questionnaires appropriately—and thereby develop students' feedback competences are indeed advisable. Accordingly, Göbel et al. (Chap. 11 of this volume) call for training students to use the survey instruments adequately. Unfortunately, the authors of these survey instruments frequently do not provide the users with clear guidelines for implementing the instruments (Bijlsma, Chap. 4 of this volume).

It has been well documented that individual characteristics of teachers influence whether and how effectively teachers use student feedback measures to improve their teaching and teaching skills, as Röhl and Gärtner (Chap. 10 of this volume) document in their literature review. In their discussion, the authors particularly emphasize teachers' attitudes toward students as a feedback providers (e.g., regarding their trustworthiness or competence) and whether teachers perceive the function of the feedback as an opportunity to develop their teaching. The effectiveness of student feedback is also influenced by the teachers' attitudes toward the measuring process of feedback. In general, teachers tend to show a positive attitude toward formative forms of student feedback on teaching (e.g., Göbel et al., Chap. 11 of this volume). But it is not uncommon that accuracy and trustworthiness are questioned, especially when it comes to feedback from younger students. Pre-service teachers, on the other hand, tend to be more positive when considering using student feedback on teaching than in-service teachers, as the findings of Göbel et al. (ibid.) indicate. In their insightful investigation, they demonstrate that experiences with student feedback can have a further positive impact on pre-service teachers' attitudes concerning student feedback in general and on their willingness to reflect on and modify teaching practices. These results thereby illustrate the potential of a widespread implementation of student feedback on teaching within the practical parts of teacher education.

As the results in this volume show, the ways in which teachers perceive, process, and make use of the feedback information determines its impact on their teaching. Accordingly, the SFT Model (see Chap. 1 of this volume by Röhl et al.) puts an emphasis on teachers' processes and handling of feedback information. At present, research on the ways in which teachers perceive and interpret feedback information affectively and/or cognitively, how this influences them, and how they deal with it is still rather scarce. But as the investigations of Röhl and Rollett (2021) show, teachers can very much differ in why and how they make use of student feedback on teaching. Their analysis evinced four paths of utilization: (1) Direct Formative Use (identifying aspects to improve, setting goals, evaluating target achievement); (2) Direct Communicative Use (discussing the results and looking for improvements in class); (3) Indirect Use (enabling a positive emotional experience, gathering of information); and (4) Symbolic Process Orientated Use (signaling a democratic or student-orientated attitude and an openness to criticism in classes). These results point to the importance of paying more attention to the goals individual teachers pursue when they ask their students for feedback on their teaching.

# **4 Organizational Context of the Evaluation and the Presentation of Feedback Information to Stakeholders**

The conditions of the organizational context in schools can vary largely, and these differences can influence the effectiveness of student feedback (see Chap. 10 by Röhl and Gärtner; Chap. 11 by Göbel et al.; and Chap. 8 by Wisniewski and Zierer in this volume). The relevance of organizational characteristics for the effects of feedback is also evident from research on multisource feedback in business enterprises (Chap. 14 of this volume by Fleenor). The school setting can provide resources which strongly support teachers within the student feedback cycle, thus fostering its effectiveness (see Chap. 9 of this volume by Röhl). Schools may offer team structures which intensely accompany the process of reflecting on the feedback as well as the subsequent professional development and changes in teaching practices. Furthermore, it depends very much on the learning culture within the organization whether the feedback is considered as an opportunity to learn and whether sustainable support is provided to act on the results. Correspondingly, Röhl and Gärtner (Chap. 10 of this volume) highlight—in their literature review on the conditions of effectiveness of feedback—the importance of a positive feedback culture, organizational safety, and a focus on the professional development of teachers in contrast to a focus on control. In their approach, the feedback culture is considered as a crucial moderator for the effectiveness of feedback measures. Accordingly, the school management and leadership have a special responsibility for the success of student feedback measures by ensuring a safe learning environment and shaping a positive feedback culture within schools. Elstad and colleagues (2015, 2017) report a higher appreciation of the results of student feedback on teaching when a developmental purpose is perceived by the teachers, whereas perceiving a control purpose is linked to a rejecting attitude to the feedback measures and a lower recognition of the feedback information. Furthermore, other findings point out that its effects on teaching practices differ depending on whether teachers are intrinsically or extrinsically motivated to engage in using student feedback, but that both motivational paths are related to positive changes in the classroom (Gärtner, 2014).

How much the organizational context matters is well illustrated by a project described by van der Lans (Chap. 5 of this volume). It followed a very sophisticated data-driven procedure: (1) determining reliable diagnostic results of individual teaching practices (from ineffective to effective); (2) allocating teachers on an empirically validated continuum of teaching effectiveness; (3) identifying the most effective development measures; and (4) tailoring the feedback procedure accordingly. Taken together, these measures provide a viable basis for the teacher's further education and professional development. This is aligned to the research field of data-based decision-making, which suggests that data (in this case student feedback data) can help improve teaching and further outcomes for students (Poortman & Schildkamp, 2016; Schildkamp, 2019; van Geel et al., 2016).

A critical issue for the effectiveness of student feedback is how it is presented to the teachers. Problems concerning accuracy and comprehensibility of feedback have been addressed for a long time (e.g., Frase & Streshly, 1994). Thereby, the designing of feedback and of support measures has to take into account the level of data literacy of teachers in order to overcome the typical struggles of making use of the data (e.g., Kippers et al., 2018). One way to reduce the complexity of the gathered student data is by condensing the information into a smaller number of performance levels, which makes it considerably easier to communicate individual strengths and weaknesses. However, this requires an adequate level of support to prepare the data accordingly. One should, nevertheless, be careful not to disregard the potentially meaningful variance of the student ratings within classes (see Chap. 6 of this volume by Schweig and Martínez).

Another important prerequisite for the effectiveness of student feedback is that the communication within the feedback cycle is performed in an appreciative and constructive manner. Yet the ability to formulate and provide feedback as well as the ability to receive and respond to feedback can vary considerably. These differences can significantly influence the cognitive and emotional dynamic within the feedback process and its effectiveness. Effective feedback can be characterized as task orientated, specific, clear, development orientated, and distinct in its implications for action (Cannon & Witherspoon, 2005). Röhl and Gärtner (Chap. 10 of this volume) discuss how characteristics of the feedback may influence its effectiveness in terms of the information format (e.g., means, boxplots), the timing of the feedback, its specificity, valence, and positivity. Unfortunately, there are only few studies in the field of student feedback on teaching addressing these issues.

# **5 Concluding Remarks**

The present volume is the first providing a comprehensive overview of the current state of the research on student perceptions of and student feedback on teaching in schools. Its aim was to coherently present to a wider audience the extensive and important international research which has been done using student perception of teaching for improving teaching practices. The authors contributing to this volume agree in granting student feedback a high potential for the improvement of teaching in schools. The empirical evidence for this claim' which is addressed across the chapters of this book' is impressive. If set up professionally, the implementation of student feedback on teaching can be indeed a very effective way to improve teaching quality.

But there are, of course, requirements which have to be met to achieve these positive results. On the one hand, a high quality of the survey instruments and the accompanying material provided by their authors is indispensable. On the other hand, the provision of a high quality of support within the school setting is needed gathering and evaluating the data, interpreting and reflecting on the information, and putting results effectively into action. Accordingly, it has been shown that the availability of an adequate level of support is an important moderating variable concerning the effectiveness of student feedback on teaching (Röhl, Chap. 9; and Göbel et al., Chap. 11 in this volume).

In order to make better use of the potential of student feedback on teaching, different paths should be followed: Authors of student perception questionnaires should put more emphasis on providing users with sound and easily accessible information (concerning e.g., theoretical basis, measurement quality, reference norms, and guidelines for implementing and working with the instruments). Practitioners should be encouraged to use student feedback by establishing sustainable support structures within schools, which include powerful technical solutions for implementing, evaluating and acting on student feedback. Researchers should intensify investigations on teachers' ways of processing feedback information, on how a professional and ongoing implementation of student feedback in schools affects the longitudinal development of teachers, teaching, and last but not least students.

In several respects, the findings presented in this volume' indicate that the summative use of student feedback for teacher accountability to supervisors is hardly appropriate. Thus, when teachers perceive a control function of the feedback, they tend to be more resistant to the developmental use of it (see Chap. 10 of this volume by Röhl and Gärtner). Further, student perceptions of teaching quality are subject to many idiosyncrasies or biasing factors, requiring highly expert and cautious interpretation (see Chaps 3 and 7 of this volume by Röhl and Gärtner). Therefore, the use of student feedback for accountability purposes should be avoided in order to prevent damaging its developmental potential.

A promising development to support the capturing, evaluating, and scrutinizing of student feedback data are online or smartphone-based survey instruments (e.g., the Impact! tool, Bijlsma et al., 2019; FeedbackSchule, Wisniewski et al., 2020). If set up accordingly, they can provide users with easily accessible differentiated feedback information on individual-, group-, and class-level or even might provide corrected scores for known biases. Digital solutions can be an excellent way to gather and evaluate student feedback on teaching, thus reducing the needs on resources significantly. Of course, they cannot in any way substitute the professional reflection of individual teachers within collegial settings on the feedback results, but they can help schools substantially in creating the informational basis for these processes and help them to make better use of the information and their typically limited time and staff resources.

Another very inspiring perspective for the future of student feedback is provided by Schmidt and Gawrilow (see Chap. 12 of this volume) when advocating the implementation of measures for systematic reciprocal feedback between students and teachers, thereby addressing teachers and students as cooperative partners. The potential of combining approaches of feedback from students to teachers with those of feedback from teachers to students provides an important outlook for developments to come.

An important point arising from the research overview presented in this volume is the question of why some countries and regions seem to be more reluctant to use student feedback to improve teaching and professionalize teachers than others (see e.g. Chap. 3 by Bijlsma, and Chap. 9 by Röhl). Here, cultural aspects like the role of teachers and students, but also characteristics of the school systems could provide an explanatory framework. These issues should be elaborated on in further research in order to develop appropriate and effective forms of making use of student feedback on teaching for these cultural contexts and countries.

As outlined throughout this book, student feedback on teaching is a highly beneficial and—from our point of view indispensable—way to improve teaching practices. Based on the extensive body of research on the benefit and effectiveness of student feedback on teaching presented in this volume the authors hope to contribute to a wide and systematic use of student feedback in schools to sustainably improve teaching quality and the learning experiences of students.

# **References**


K. Neuber, & M. Raaflaub (Eds.), *Quo vadis Forschung zu Schülerrückmeldungen?* Springer VS. https://doi.org/10.1007/978-3-658-32694-4.


**Wolfram Rollett** is a Professor of Empirical Educational Research at the University of Education Freiburg and the Freiburg Advanced Center of Education (FACE). Previously he worked as a researcher and lecturer in the field of Educational Science and Psychology at the Universities of Potsdam, Braunschweig, Dortmund, and Wuppertal. His research focuses on school development processes, the quality of extra- and co-curricular activities, educational effectiveness, and classroom composition effects.

**Hannah Bijlsma** is a researcher (Ph.D.) at the section of Teacher Professionalization at the University of Twente (the Netherlands) and a primary school teacher (grade 1). Her research focuses on measuring teaching quality and on the use of student perceptions of teaching quality in school contexts. In 2016 she founded a professional association for academic primary school teachers, of which she has been chairman for about five years. She is now a board member of the International Congress for School Effectiveness and Improvement (ICSEI) and a board member of the EARLI SIG on School Effectiveness and Improvement.

**Sebastian Röhl** has been an Academic Assistant in the Department of Educational Science at the University of Education Freiburg and is currently a Postdoctoral Researcher in the Institute of Education at Tübingen University (Germany). Before that he worked for more than 10 years as a grammar school teacher, school development consultant, and in teacher training. Among other areas, he conducts research in the fields of teaching development and teacher professionalization through feedback, social networks in inclusive school classes, as well as teachers' religiosity and its impact on professionalism. In addition, he is the Director of an in-service professional master's study program for teaching and school development.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.